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Preface 



This volume contains the proceedings of CHARME 2003, the 12th Advanced Re- 
search Working Conference on Correct Hardware Design and Verification Me- 
thods. CHARME 2003 continues the series of working conferences devoted to the 
development and use of leading-edge formal techniques and tools for the design 
and verification of hardware and hardware-like systems. 

Previous events in the ‘CHARME’ series were held in Edinburgh (2001), 
Bad Herrenalb (1999), Montreal (1997), Frankfurt (1995), Arles (1993) and Tu- 
rin (1991). This series of meetings were organized in cooperation with IFIP WG 
10.5 and 10.2. Prior meetings, stretching back to the earliest days of formal hard- 
ware verification were held under various names in Miami (1990), Leuven (1989), 
Glasgow (1988), Grenoble (1986), Edinburgh (1985) and Darmstadt (1984). We 
now have a well-established convention whereby the European CHARME confe- 
rence alternates with its biennial counterpart, the International Conference on 
Formal Methods in Computer-Aided Design (EMC AD), which is held in even- 
numbered years in the USA. 

CHARME 2003 took place during 21-24 October 2003 at the Computer 
Science Department of the University of L’Aquila, Italy. It was cosponsored by 
the IFIP TCIO/WGIO Working Group on Design and Engineering of Electronic 
Systems. 

The CHARME 2003 scientific program was comprised of: 

— A morning Tutorial by Daniel Geist aimed at industrial and academic in- 
terchange. 

— Two Invited Lectures by Wolfgang Roesner and Fabio Somenzi. 

— Regular Sessions, featuring 24 papers selected out of 65 submissions, ran- 
ging from foundational contributions to tool presentations. 

— Short Presentations, featuring 8 short contributions accompanied by a 
short presentation. 

The conference, of course, also included informal tool demonstrations, not 
announced in the official program. 

The topics in 2003 represented a change in the traditional conference reper- 
toire. The motivation for this change was the general feeling that the tools and 
methodologies of the last decade have outrun their course. Specifically, hardware 
design today is driven to be specified in higher level of abstraction, with the ad- 
vent of design languages such as SystemC and SystemVerilog. This stems from 
the fact that there is a definite crisis in our ability to harness the silicon that 
can today be manufactured on a single chip. The distinction between software 
and hardware is also getting blurry, since the architectures of systems-on-chips 
(SOCs) do not always determine up front what part of the chip’s functionality 
should be implemented in hardware and what part should be implemented in 
software as embedded code (firmware). 




VI 
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This situation of large silicon real estate raises many questions, and there 
are currently very few answers. It is up to the CHARME community to pioneer 
new directions in which the silicon industry should head in order to sustain the 
great success it has had in recent times. Our choice was to emphasize modelling 
and software in this conference. We hope that these will turn out to be the right 
choices, but only time will tell if we were right. 

We are very grateful to the program committee and to all the referees for 
their assistance in selecting the conference papers. 

Warm recognition is due to Giuseppe Della Penna, Benedetto Intrigila and 
Igor Melatti for taking care of the CHARME 2003 organization. 

Special thanks are due to Giuseppe Della Penna for the CHARME 2003 
Web, flier and poster design, as well as for taking care of too many aspects of 
the CHARME 2003 organization to mention them all. 

IBM Labs in Haifa took care of printing and mailing CHARME 2003 fliers. 
We are grateful to Ms. Tamar Yogev for assisting us in this effort. 

The organizers are very grateful to IBM, INTEL, the University of L’Aquila, 
and Regione Abruzzo, whose sponsorship made a significant contribution to 
financing the event. 

Warm recognition is due to the technical support team. Markus Bajohr at 
the University of Dortmund together with Martin Karusseit of METAFrame 
Technologies who provided invaluable assistance to all the people using the online 
service during the crucial months preceding the conference. 

Finally, we are grateful to Ms. Anna Kramer and to all the Springer LNCS 
editorial team for their first-class support during the preparation of this volume. 
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What Is beyond the RTL Horizon for 
Microprocessor and System Design? 



Wolfgang Roesner 

IBM Server Group 

11400 Burnet Road Austin, Texas 78758, USA 
wolf gangSus . ibm . com 



Abstract. The current state of hardware logic design and verification is 
discussed based on the project flow used for IBM’s Power! and Power5 
projects. 

The frequency and power requirements for these high-end chips constrain 
the logic design to a detailed RT-level in order to control physical effects. 
On the other hand, the complexity of the designs which embrace many 
speculative mechanisms to push functional performance to higher levels 
force an early specihcation of the microarchitecture with a high-level 
model. 

A review how high-level modeling has advanced is based on the discussion 
which mechanisms of abstraction raise the specihcation above the RT- 
level. A critique of specihcation language design leads to the appeal to 
the formal verihcation community to focus efforts on the front-end of the 
high-level design process to help shape modeling languages with formally 
dehned semantics that avoid the mistakes made in the past with ad-hoc 
language designs. 
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The Charme of Abstract Entities 



Fabio Somenzi* 

University of Colorado at Boulder 
Fabio@Colorado.EDU 



Abstract. Abstraction is fundamental in combating the state explosion 
problem in model checking. Automatic techniques have been developed 
that eliminate presumed irrelevant detail from a model and then refine 
the abstraction until it is accurate enough to prove the given property. 
This abstraction refinement approach, initially proposed by Kurshan, has 
received great impulse from the use of efficient satishability solvers in the 
check for the existence of error traces in the concrete model. Today it 
is widely applied to the verihcation of both hardware and software. For 
complex proofs, the challenge is to keep the abstract model small while 
carrying out most of the work on it. We review and contrast several re- 
Hnement techniques that have been developed with this objective. These 
techniques differ in aspects that range from the choice of decision pro- 
cedures for the various tasks, to the recourse to syntactic or semantic 
approaches (e.g., “moving fence” vs. predicate abstraction), and to the 
analysis of bundles of error traces rather than individual ones. 



* Supported in part by SRC contract 2002-TJ-920. 
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The PSL/Sugar Specification Language 
A Language for all Seasons 



Daniel Geist 

IBM Haifa Research Lab. 

Haifa University, Mount Carmel Haifa, Israel 
geistOil . ibm . com 



Abstract. The Accellera EDA standards body has recently approved 
the PSL a standard property specification language for use in assertion- 
based verification via simulation and formal verification tools. This lan- 
guage, which is based on the Sugar language from IBM, is now supported 
by many EDA vendors. More than 40 individuals representing over 20 
companies participated in the efforts to form the PSL standard from its 
Sugar basis. 

The tutorial comprises 2 parts. In the first part, we describe the ba- 
sic principles of PSL/Sugar, focusing on the ease with which complex 
design behaviors may be described with concise, readable PSL/Sugar 
assertions that crisply capture design intent. We summarize the tempo- 
ral constructs of the language, including parameterized sequences and 
properties, directives, and modeling capabilities. We cover the general 
timing model of PSL/Sugar, which transparently supports both (single- 
or multi-clock) synchronous and asynchronous design, and, time permit- 
ting, we explain how PSL/Sugar has been defined to ensure consistent 
semantics for both simulation and formal verification applications. 

In the second part of the tutorial, we present several applications of 
PSL/Sugar, ranging from simple to advanced assertion-based verification 
solutions. These include use of PSL/Sugar for dynamic assertion checking 
and formal model checking, including support for environment modeling 
and assume/guarantee reasoning. Examples of commercial verification 
tools which support the PSL/Sugar languages will also be presented. 
Participants in the tutorial will have an excellent opportunity to learn 
about both the language and its applications directly from the speaker. 
Dr. Danny Geist, who heads a research group in the IBM Haifa lab where 
Sugar was conceived. 
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Finding Regularity: Describing and Analysing 
Circuits That Are Not Quite Regular 



Mary Sheeran 

Chalmers University of Technology 
ms@cs . Chalmers . se 



Abstract. We demonstrate some simple but powerful methods that ease 
the problem of describing and generating circuits that exhibit a degree of 
regularity, but are not as beautifully regular as the text-book examples. 
Our motivating example is not a circuit, but a piece of C code that is 
widely used in graphics applications. It is a sequence of compare-and- 
swap operations that computes the median of 25 inputs. We use the 
example to illustrate a set of circuit design methods that aid in the 
writing of sophisticated circuit generators. 



1 Introduction 

In arithmetic and digital signal processing, many algorithms are well understood, 
and result in efficient regular circuits. The functional approach to hardware de- 
sign has proved particularly well-suited to the development of such circuits [3, 
10]. Here, we continue to explore this theme; this paper is not about verification, 
but about design methods - a valid, if under-represented, topic of the Charme 
conference. We emphasise the description of circuits, as we feel that ease of 
describing the intended circuit is a key to design productivity. The methods 
presented here go beyond what can be done in VHDL or C, through the use of 
higher order functions and polymorphism, which are features of many functional 
programming languages. The examples shown use Lava, a hardware design sys- 
tem implemented as an embedded domain specific language in the functional 
programming language Haskell [2]. 

Batcher’s classic odd even merge sorting algorithm illustrates the power and 
elegance of the combinator-based approach to describing complex networks: 

oemerge : : Int -> ( [a] -> [a] ) -> [a] -> [a] 
oemerge 1 s2 = s2 

oemerge n s2 = ilv (oemerge (n-1) s2) ->- odds s2 

oesort : : Int -> ( [a] -> [a] ) -> [a] -> [a] 

oesort 0 s2 = id — the identity function 

oesort n s2 = two (oesort (n-1) s2) ->- oemerge n s2 

Here, ilv, for interleave, is a combinator that applies the given function to the 
odd and even elements of a list of inputs, to produce the odd and even elements 
of the output list. So, the function ilv reverse applied to the list [1. .8] gives 
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Fig. 1. two (oesort 2 s2) 



oemerge 



3 s2 



[7,8,5,6,3,4,1,21. reverse is a Haskell function whose type is [a] -> [a] . 
It takes a list of elements of any type a to a list of elements of the same type. 
It is a polymorphic function and works at many types. Similarly, ilv has type 
( [a] -> [b] ) -> [a] -> [b] . It takes a function from list of a to list of b 
and returns a function of the same type. In functional programming parlance, 
it is a higher order function; it takes a function and returns a function. We use 
polymorphic higher order functions like ilv to capture circuit interconnection 
patterns. A second such function is two, which applies a function to the first 
n elements and to the second n elements of a 2n-length input list, so that, 
for instance, two reverse [1..8] is [4,3,2,1,8,7,6,51. Serial composition 
is written ->-, and odds s2 applies s2 to pairs of adjacent elements of the 
input, but starting with the second element rather than the first. The function 
oesort is parameterised both on an integer and on a two-input, two-output 
sorter component, s2. The integer and the s2 parameter determine the size and 
type of the resulting network. For instance, oesort 3 intSort2 is a circuit that 
sorts lists of integers of length 2^, built from a component that sorts a 2-list of 
integers, intSort2. 

intSort2 : : [Signal Int] -> [Signal Int] 

intSort2 [x,y] = [imin (x,y), imax(x,y)] 

To illustrate the combinators, oesort 3 s2 is shown in figure 1. Values flow 
through the network from left to right, and the vertical lines are 2-sorters. The 
first (or leftmost) value of the input list is input along the top wire. 

The oesort pattern can be instantiated with many different comparator 
components, depending on the context in which the sorter is to be placed. The 
same description can be used to give bit-parallel and bit-serial implementations, 
simply by plugging in new comparator components. The object of study is the 
connection pattern from which both combinational and sequential sorters can be 
built. To perform verification, we plug in a 2-sorter on bits (bitSort2) and, using 
the 0-1 principle [7], verify functional correctness by generating and checking a 
propositional formula that states that a fixed-size circuit obeys the required 




M. Sheeran 



sorting property. The 0-1 principle states that if a network with n input lines 
sorts all 2" sequences of Os and Is into nondecreasing order, it will sort any 
arbitrary sequence of n numbers into nondecreasing order. We have studied the 
design and analysis of sorting networks in a previous paper [3], and we use the 
same verification methods in this paper. 

The problem that we want to address here is the fact that not all circuits 
are beautiful. They don’t all have a number of inputs that is a power of two, 
and they don’t all have such an obvious recursive structure. For example, how 
would we describe any 7-sorter that contains the minimal number of comparators 
(which is known to be 16 [7])? More generally, how do we describe circuits that 
are somewhat regular? 

Via a running example, a median circuit, we present a series of ideas for how 
to make more sophisticated circuit descriptions, using polymorphism and higher 
order functions. Shadow values and clever components are aids to writing circuit 
generators. Non Standard Interpretation is an old idea that we (and others) have 
used before. Here, we use ordinary polymorphism and components of different 
types, and do not rely on Haskell’s type classes (although type classes are used 
extensively in the Lava implementation). Finally, we needed to extend our range 
of combinators in order to explore a variety of solutions to the median problem. 
We have deliberately not used the more esoteric parts of Haskell, in the hope of 
making the ideas usable in other contexts. 

The median example was inspired not by a circuit, but by a piece of C-code, 
due to Paeth, which appears in Graphics Gems I, a book of classic graphics 
algorithms [1 1] . It is a sequence of 99 compare-swap operations that arranges an 
array of 25 inputs so that the median element is in the middle position, and all 
smaller elements are at lower indices (and hence all larger are at larger indices). I 
first came across a transliteration of this code in reference [5] , where it is claimed 
(informally and without justification) that this function cannot be performed in 
fewer than 99 comparison-swaps without further information about the input. 
The application area of such programs (and circuits) is median filtering of digital 
images, in which n by n windows of the image have their middle pixel replaced 
by the median pixel, thus removing white noise. A 5 by 5 kernel (as it is called) 
is often used, so the algorithm is of practical interest. A common approach is to 
actually sort the 25 pixels, using Batcher’s odd even merge sort, but in a more 
general variant that allows the division of the input into two parts of unequal 
length. That would take 138 comparators. 

2 Shadow Values I 

The user of Lava describes circuits by writing circuit generators. For example, 
in the oesort example above, the recursive description is instantiated at a par- 
ticular size, and with a particular type of comparator, in order to produce a 
circuit. When we simulate, say, an 8-sorter on integers, what happens is that 
in the background a representation of the concrete circuit is created, and the 
simulate function walks over that representation: 
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simulate (oesort 3 intSort2) [3, 2 , 1 ,6 ,5 ,4,0,7] 

[0,1,2,3,4,5,6,71 

Here, the values that flow through the circuit are of type Signal Int and 
are circuit level values (even though they look like integers). The component 
intSort2 sorts two such circuit level integers. However, the 3 that is a param- 
eter to oesort is an ordinary Haskell integer. This is an important distinction, 
at least intuitively, as the Lava user must be able to tell what is a circuit de- 
scription and what is a more general Haskell function. There are circuit level 
values (with Signal types), and there are ordinary Haskell values that are used 
in the generation of circuits. Once we have got to a concrete circuit in the in- 
ternal netlist representation, all the ordinary Haskell values have disappeared. 
But in writing the Haskell code that is to be used to generate such a netlist, we 
can make use of ordinary Haskell values, and can make decisions about how the 
circuit should look, based upon them. A common pattern is to pair a Haskell 
value with each circuit level value. The shadow values can control the shape of 
the resulting circuit. 

The simplest form of shadow value is just a boolean that indicates whether 
the corresponding wire should have any components attached to it. The Haskell 
function tomarked f applies f only to those inputs that are paired with True. 
It simply passes through those inputs that are paired with False. 

Main> tomarked (map (*2)) [(1 , True) , (3, False) , (5, True)] 

[(2, True) , (3, False) , (10, True)] 

Here, only the first and third values are doubled. We can use this idea when 
generating circuits. If f is a connection pattern that places instances of the 
component s in a particular way on n inputs, to give n outputs, we might want 
to get a circuit with n — i inputs by deleting the top i wires and all components 
attached to them. The resulting circuit will take n — i inputs. We pair each of 
those n — i real inputs with True, and then add i dummy inputs paired with 
False. Then, we can apply f (tomarked s) to the resulting marked list, secure 
in the knowledge that the dummy wires will never be touched. Then, we can drop 
the dummy wires, and all the marks, to produce n — i circuit level outputs. This 
is what the function cutTop i does. Similarly, cutTopBottom i j cuts i wires 
at the top and j on the bottom. Note that a component that is an argument to 
tomarked must be flexible, in that it may be required to deal with a number of 
arguments that is smaller than usual, because of the presence of inputs marked 
with False. In our sorting example, this means that we need a component that 
is not just a 2-sorter, but that can also deal with one or even zero inputs. The 
function smallSort takes a two-input sorter and makes it flexible in this way. 
We will have reason to extend this function later. 

smallSort s2 [] = [] 

smallSort s2 [a] = [a] 

smallSort s2 [a,b] = s2 [a,b] 

For example, we can make a 7-sorter from an 8-input odd even merge sorter 
by using cutTop 1 and (oesort 3) . The resulting network is shown in figure 2. 
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Fig. 2. cutTop 1 (oesort 3) (smallSort s2) 



It is derived from the network shown in figure 1 by omitting the top wire, and 
the three comparators connected to it. 

In this instance, the resulting netlist has only 7 inputs and 7 outputs, and 
it no longer looks very regular. All history of how that netlist was generated 
using shadow booleans is forgotten at this stage. The reader might argue that 
one could just use padded inputs and leave the pruning of unnecessary gates 
and wires to the lower level design tools. However, we find this approach more 
convenient and less error prone. We have found that padding makes for unread- 
able circuit descriptions, and can lead to the introduction of bugs. Also, we often 
make designs in which we first develop abstract circuits (say with integers whose 
representation has not yet been chosen flowing on wires). We want to be able 
to prune these circuits at an early stage in the design, before we are ready to 
produce input to lower level design tools. 

Formal verification using a SAT-solver is done in the usual way [3]. (Satzoo 
is a SAT-solver developed by Een here at Chalmers [6]. The function satzoo 
creates a file in DIMACS format that is passed to the solver, the output of 
which is then passed back to the Haskell interpreter.) 

sortCheck n cct = 

satzoo (prop_doesSortsize (cct (smallSort bitSort2)) n) 

Main> sortCheck 7 (cutTop 1 (oesort 3)) 

Satzoo: ... (t=0.0) Valid. 

Because we consider only restricted forms of networks, we choose not to prove 
that the networks permute their inputs. Such proofs, if required, can also be 
done using a SAT-solver in Lava. 

3 Non-standard Interpretation 

We have already seen how to verify sorting networks by using a 2-sorter on 
bits and the 0-1 principle. This is an example of non-standard interpretation, 
in which we replace the circuit components with others that are intended to 
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gather information about the circuit. We then simulate the circuit with the new 
components, and suitable initialising inputs, to perform the required analysis. 

To count the number of comparators in a circuit, we replace each comparator 
by a component that adds one to its left hand input and passes its right hand in- 
put through unchanged. Then, at the end, we sum all of the numbers appearing 
on the output. (This simple method works as long as all of the information- 
carrying wires eventually reach the output, but that is the case for all of our 
networks.) We simulate the resulting circuit on a list of zeros. Note that csize2 
is most definitely a circuit level component, whose inputs and outputs are lists 
of integer signals. It is included as a first step towards the use of such functions 
during circuit generation, rather than, as here, during simulation. A more gen- 
eral count function would be a recursive function over the internal data type 
representing circuits. 

csize2 : : [Signal Int] -> [Signal Int] 
csize2 [i,j] = [plus(l,i) , j] 

count n cct 

= simulate (cct (smallSort csize2) ->- sum) (replicate n 0) 

Main> count 7 (cutTop (oesort 3)) 

16 



The 7-sorter has as few comparators as possible. Circuit depth is just as easy 
to calculate. Again, integers flow on the wires, and the depth of the output of a 
comparator is one more than the integer maximum of the inputs. The 7-sorter 
has optimal depth (which is 6) [7]. 

cdepth2 : : [Signal Int] -> [Signal Int] 
cdepth2 [x,y] = [m,m] 

where m = plus(l , imax(x,y) ) 

depth n cct 

= simulate (cct (smallSort cdepth2) ->- imaximum) (replicate n 0) 

Cutting 2 wires on the top of an 8-sorter also gives a size-optimal circuit with 
12 comparators. We don’t do so well when the number of inputs to the sorter is 
just above a power of two, rather than just below. The smallest known 9-sorters 
have 25 comparators, but cutting 7 wires from a 16-input odd even merge sort 
gives a 28-comparator sorter. 

Our next step is to generalise the combinators ilv and two to be multi-way 
rather than two-way. This leads us to a generalisation of odd even merge sort, 
and also broadens the range of sorters and other networks that can be described 
easily. 



4 Generalised Combinators 

Recall that two f applies f to each half of a list. Its generalisation, pari i f 
applies f to each ith part of the list, so that, for instance, pari 5 f applies f 
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to each fifth of the list. The function concat fiattens a list of lists back into a 
list. The general version of ilv instead chops the list into i-length sublists and 
transposes, to give i sublists, before applying map f and then returning the list 
to its original order. 

pari if = chopinto i ->- map f ->- concat 

ilvl if = chop i ->- transpose ->- map f ->- transpose ->- concat 

Armed with these new combinators, we can generalise oesort, provided we 
can figure out what odds should become. Well, odds s2 sorts an almost-sorted 
list. It is able to sort the list by comparing only adjacent elements, and it com- 
pares only those elements that have not already been compared. For i = 3, 

it turns out that the new pattern, which we will call fmerge i, should com- 

pare elements a distance two apart, and then adjacent elements, while refraining 
from comparing elements whose relation is already known. In general, fmerge i 
should first compare elements a distance i — 1 apart, then i — 2 and so on, down 
to 1. The function dist i k ss applies ss to elements a distance k apart, but 
avoids comparing elements in each i-length sublist. 

fmerge i ss = compose [dist i k ss I k <- reverse [l..(i-l)]] 
oemergel i 1 ss = ss 

oemergel i n ss = ilvl i (oemergel i (n-1) ss) ->- fmerge i ss 
oesorti i 0 ss = id 

oesorti i n ss = pari i (oesorti i (n-1) ss) ->- oemergel i n ss 

Think of the second parameter to oesorti as the number of dimensions. The 
instance oesort i j sorts a list of length i to the power of j . The i parameter, 
the size of each dimension, must be odd, although 2 works as a special case 
(and gives Batcher’s odd even merge sort shown earlier). For larger even- length 
dimensions, some extra comparators are needed, but we will not pursue this 
topic here. 

Now, if we are to use this general sorting algorithm for i greater than 2, we 
must be able to make sorting components (for use as the ss parameter) for more 
than two inputs. To do this, we extend the function smallSort that was intro- 
duced earlier. The 3-sorter is made from three comparators, and is completely 
standard. The 4- and 5-sorters are made from oesort (and are optimal in both 
size and depth) . Larger sized sorters are easily included in a similar way, and it 
may then make sense to change the style of the definition to a case analysis on 
the length of the input. 

sortSl s2 [x,y,z] = [a,b,c] 
where 



[xl,yl] = 


s2 


[x,y] 


[y2,c] = 


s2 


[yi.z] 


[a,b] = 


s2 


1 — 1 
CN 

T— 1 
1 1 



smallSort s2 [] 
smallSort s2 [a] 



[] 

[a] 
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smallSort 


s2 


[a,b] 


= s2 [a,b] 




smallSort 


s2 


[a,b,c] 


= sortSl s2 


[a,b,c] 


smallSort 


s2 


[a,b,c,d] 


= oesort 2 


s2 [a,b,c,d] 


smallSort 


s2 


[a,b,c,d,e] 


= cutTop 3 


(oesort 3) (smallSort s2) 



[a,b,c,d,e] 



If we restrict oesorti to two dimensions, we get the sorting algorithm pro- 
posed by Kolte et al [8] from Motorola. In that case, the rows and columns of 
the ixi grid are first sorted, and then the call of fmerge i sorts all the diagonal 
lines, starting with the main diagonals. What we add here is both a much more 
streamlined verification process and the generalisation to more than two dimen- 
sions. The paper by Kolte et al proposes an elaborate scheme for testing the 
proposed sorting network, but the use of a SAT-solver and the 0-1 principle is 
a much easier option. On the other hand, the Motorola paper develops software 
for a complete median filter that gives impressive performance on a particular 
architecture. It would be very interesting to develop an efficient median filter on 
an FPGA and compare its performance with more standard implementations. 
That is future work. 

Using 3 dimensions, for example, we can quickly analyse a 27-sorter (made 
from 3- and 2-sorters) to find that it has depth 20 and size 154. This is one 
comparator smaller (though considerably deeper) than the general two-way odd 
even merge. We will make use of oesorti 3 3 later, when constructing the 
25-median circuit. 

Further discussion of the algorithm oesorti is beyond the scope of this paper. 
We believe that fmerge could be improved for larger dimension sizes, and Van 
Voorhis’ work shows how to deal with even-length dimensions [12]. Independent 
of the example, we are pleased with the simplicity of the generalised combinators. 
They give the user access to a broader range of connection patterns, without the 
need to learn many new combinators. 

Now, we return to the 25-median problem. To solve it, we need to use more 
complicated shadow values than those that we have seen so far. We aim to keep 
only those parts of a sorter that contribute to arranging the outputs of the 
median circuit into an order that satisfies the specification. 



5 Shadow Values II 

We saw in section 3 that we can gather information about an instantiated circuit 
by simulating it using specially designed circuit level components like csize2. 
Here, we use similar ideas, but in the world of shadow values. Shadow values 
have so far been unchanging Boolean values. Now, we make them more dynamic 
and more complicated. 

The idea is to use shadow values to record information about the circuit so 
far, allowing decisions to be made about how the rest of the circuit should look. 
For the median example, what we want to do is to figure out for each “wire” 
in the circuit whether or not it is still in the running to be the median, and so 
needs to be processed further. And we want to do this figuring out at circuit 
generation time. This is not straightforward, and requires some insights into 
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Fig. 3. bflyl 3 2 (smallSort s2) ->- fmerge 3 (smallSort s2) 



the mathematics of sorting. We cannot go into the details here, but the reader 
is referred to the work of Van Voorhis to see the kinds of arguments that are 
required [12]. Our approach is to rewrite our sorter so that the first steps are 
to sort the different dimensions of the input. So, for example, a two-dimensional 
sorter will start by sorting the rows and columns, and a three-dimensional sorter 
will sort along each of the three axes. This pattern is called a butterfly network. It 
is straightforward to rewrite oesorti into a butterfly network of sorters followed 
by the rest, which we call bafterl. boesorti 3 2 is shown in Figure 5. It is 
essentially the same as the optimal 25-comparator 9-sorter due to Floyd [7]. 

bflyl i 0 f = id 

bflyl i n f = pari i (bflyl i (n-1) f) ->- (iter (n-1) (ilvl i) f) 
boemergel i 1 ss = id 

boemergel i n ss = ilvl i (boemergel i (n-1) ss) ->- fmerge i ss 
bafterl i 1 ss = id 

bafterl i n ss = pari i (bafterl i (n-1) ss) ->- boemergel i n ss 
boesorti i n ss = bflyl i n ss ->- bafterl i n ss 

The reason why we do this is that the sortedness of the different dimensions, 
which is the result of the initial butterfly network, remains unaffected throughout 
the rest of the network. Also, inside the butterfly, sorting each new dimension 
leaves the previously sorted dimensions still sorted. So, after the butterfly, it 
is easy to figure out, for a given wire, how many other wires are greater than 
or smaller than it. We give each wire an address that records what happened 
in the butterfly. So, for example, the address [2,1,2] is given to a wire that has 
“passed through” the top, bottom, and top of three 2-way comparators. After 
the butterfly, this wire is greater than or equal to the following set of wires: 
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[[1,1,1], [1,1, 2], [2, 1,1], [2, 1,2]]. Similarly, in the case of 27 inputs, the address [3,1,2] 
is less than or equal to the addresses [[3, 3, 3], [3, 3, 2], [3, 2, 3], [3, 2, 2], [3, 1,3], [3, 1,2]], 
after a butterfly of 3-sorters. Such calculations have been implemented in the 
functions under and over I. To calculate the list of addresses greater than a 
given one, one needs to know the size of the dimensions. 

Now, inside bafterl, on each shadow wire, we keep lists of the addresses 
that are over and under it. The shadow component for the 2-sorter manipulates 
and updates these lists, which represent sets of addresses, and so do not contain 
duplicates. The standard function nub removes duplicates from a list. 

combs2 :: [( [Address] , [Address] )] -> [( [Address] , [Address] )] 
combs2 [(ll,gl) , (12,g2)] = [(nub (11++12) ,gl) , (12 ,nub(gl++g2) ) ] 

So, the wire that “passes through” the lower part of the comparator gets a new 
(over, under) pair containing the union of the two input over lists, but only the 
lower under list. For the upper wire, the situation is dual. Then, the lengths of 
these lists give good information about the status of a wire, and its relation to the 
remaining wires. On the input to the circuit, we provide information about the 
target for each wire. In our case, we place a single (shadow) integer on each wire, 
and the wire should be taken out of the running (in the same way as with the 
simple shadow Booleans that we saw earlier) once it is known to be either greater 
than or less than that number of other wires. The target remains unchanged, 
while the address lists grow longer as one moves through the network. (One 
could choose to use two integers for the target, which could be different for the 
over and under lists, but that is not necessary in the median examples shown 
here.) The new shadow component is combine id combs2 where 

combine f g [(a,x) , (b,y)] = [(fa,gx) , (fb,gy)] 
where 

[fa,fb] = f [a,b] 

[gx.gy] = g [x,y] 

Each wire has a shadow value of type (Int , ( [Address] , [Address] ) ) , that is 
a pair of an integer and a pair of lists of addresses. 

A wire is certain not to be the median if the number of distinct addresses 
that are either smaller than or greater than it is large enough. The target is set 
to 1 -|- [n/2j, where n is the number of inputs to the median circuit. Just after 
the butterfly, the address lists are all singletons containing the address of the 
wire to which they are attached. The function placeTargetAddressI introduces 
the required initial shadow values. 

To be able to make use of these shadow values, we must generalise tomarked. 
The function onPredicate p f causes f to be applied only to those inputs for 
which the predicate p is true of the shadow value. 

Recall that the version of oesorti with the butterfly in the first columns was 

boesorti i n ss = bflyl i n ss ->- bafterl i n ss 

Following this definition, we define 

medi i j ss = bflyl i j ss ->- 

placeTargetAddressI i j ->- 
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bafterl i j (onPredicate ok (smallSort comp)) ->- 
unmark 

where 

comp = combine ss (combine id combs2) 
ok = not . (notmediani i) 

We leave the butterfly alone, but transform bafterl so that it performs the cal- 
culations described above when deciding whether or not to include a comparator. 
The result is promising: 

Main> medCheck 27 (medi 3 3) 

Satzoo: ... (t=0.3) Valid. 

Main> count 27 (medI 3 3) 

114 

Main> count 27 (boesorti 3 3) 

154 

We have a circuit that correctly places the median input in the middle output, 
and all of the smaller values to the left of it in the output list. This property is 
checked by the observer medCheck, whose key function is reallyMedian, which 
checks that a given value is larger than all of the elements of a given list, and 
smaller than all of the elements of another. Logical implication (written ==>) is 
the ordering on bits, and andl is a multi-input and gate. 

reallyMedian a smaller bigger = 

andl ( [s ==> a I s <- smaller] ++ [a ==> b I b <- bigger] ) 

Again, we use the 0-1 principle, which applies also in the context of median 
networks; for a proof of this, see [9]. (It should be noted that the 0{log n) depth 
selection networks developed in reference [9] are far from being practical.) 

We have saved 40 comparators in making a 27-median circuit from a 27- 
sorter. And the step to a 25-median circuit is now an easy one. We simply cut 
off the top and bottom wires, and attached comparators. Note that for making 
smaller median circuits from larger ones, it is necessary to crop the network 
symmetrically. To illustrate the step from a sorter to a median circuit. Figure 4 
shows a 7-median circuit made from the 9-sorter shown in figure 3. 

The 7-median circuit is optimal, but, sadly, that for 25 inputs has 102 com- 
parators. And making the 25-median circuit directly (out of 2, 3, 4 and 5-sorters), 
using medi 5 2 takes 112 comparators, although 12 of them could be pruned 
from the butterfly, which has so far been left untouched. Since we have pored 
over the Paeth code and discovered that it starts with a butterfly of 3-sorters 
that is missing its top and bottom wires, we choose to make a final change to 
the 102-comparator network. We use one last idea, clever components that adapt 
themselves to the context in which they And themselves in the final circuit. 

6 Clever Components 

When a component is applied to inputs that have shadow values, then the defini- 
tion of the component can decide what is to happen by looking at those shadow 
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Fig. 4. cutTopBottom 1 1 (medi 3 2) (smallSort s2) 



values. We have seen this several times. More interestingly, we can, in the defi- 
nition, look to see what a particular arrangement of the basic components does 
to those shadow values. This is done simply by applying the proposed circuit to 
the inputs (which are mixed concrete and shadow values) and then looking only 
at the resulting shadow values. Then, the decision about what circuit to actually 
apply to the inputs can depend on those computed shadow values. This is a kind 
of “try it and see” approach, used during circuit generation. 

To make the idea more concrete, let us return to the median circuit. Consider 
the case of a flexible 3-sorter, and our predicate (notmediani) that indicates 
when an output is now done. If the 3-sorter is applied to 3 inputs (none of 
which is done), then it might be the case that two of the outputs become done. 
In that case, we don’t need to know the order between the two done outputs, 
and we might as well use a 2-comparator min or max circuit of three inputs, as 
appropriate. So, think of a component that applies circuit A to the inputs, has a 
look at the shadow part of the result, and decides whether or not to be circuit B 
or to remain as circuit A, when producing the actual outputs of the component. 

The definition of the clever version of smallSort starts off looking very 
much like that of smallSort, but with the addition of the predicate to the 
parameters of the function. When smallSortV p ca is applied to three inputs, 
[ini , in2 , inS] , it computes sortSl ca [ini , in2 , inS] , and names the result- 
ing shadow values al, a2 and a3. By applying the predicate to those values, it 
can decide which of the maxSl, minSl or sortSl patterns to actually use. Note 
that this is not just about calculating the cone of influence of the wires that are 
not done. Removing the comparator closest to the output of a 3-sorter can give 
either the maximum or the minimum circuit, but not both. 

smallSortV p ca [ini , in2 , in3] 

= if (p al) && (p a2) then max31 ca [ini , in2, in3] 

else if (p a2) && (p a3) then minSl ca [ini , in2, in3] 
else sort31 ca [ini , in2, in3] 

where 

[(_,al) , (_,a2) , (_,a3)] = sort31 ca [ini , in2 , in3] 



If needed, smallSortV should be extended to longer inputs in a similar manner. 
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Our final median circuit generator, medVI, is identical to the previous medi, 
except that smallSort is replaced by smallSortV (notmediauil i). And this 
does the trick! cutTopBottom 1 1 (medVI 3 3) has 98 comparators. We can 
use the same descriptions to generate median circuits of other sizes. 

The modification of the sorter is most definitely a hack, though a rather 
effective one. Using another sorter, hand-crafted for the purpose, we have, in 
fact, been able to get the number of comparators down to 96, but we are unable 
to generalise that sorter to other sizes, and so chose not to present it here. The 
circuit development shown here allowed us to exemplify shadow values and clever 
components. However, for a circuit, as distinct from a C program, one should 
really aim for small depth rather than small number of comparators, so we 
have many more median circuits to explore. It would have been more pleasing 
to develop a recursive median circuit meeting our specification from scratch. 
But even when we have designed specific median circuits, we have found them 
difficult to express. They have an annoying lack of regularity, in that they tend 
to have fewer comparators in each phase as one approaches the outputs, but not 
according to any simple pattern. This is what led us to use clever components. 

There are other ways to make median circuits, for example by looking at the 
inputs bit by bit [4], and indeed they may well be better. We have restricted our 
attention to comparator-based networks so far. 

7 Related Work 

Ideas similar to shadow values and clever components were used in the genera- 
tion of the FM9001 netlist, as part of a large microprocessor verification effort 
[1]. Circuit generators (for example for the ALU) not only used precursors of 
both shadow values and clever components, but were also verified to meet their 
specifications for all sizes. This was done by a deep embedding of the DUAL-EVAL 
netlist language produced by the generators within the Boyer-Moore logic. The 
proofs required that the interpretation of the resulting netlists did indeed work 
correctly for all possible sets of inputs. This work, which is so closely in line with 
our aims, is barely mentioned in the published paper, which concentrates on the 
overall verification goal. Indeed, it is barely mentioned (in the English text) even 
in the very long technical report on the verification effort, so it will be necessary 
to delve into the code. I did not know about this work when writing the first 
version of this paper. My disappointment at discovering that my ideas are not 
as new as I thought has been overshadowed by the realisation that my tentative 
ideas for mixing clever components and verification have already been shown to 
work. 

In the current version of Lava, we perform formal verification only of fixed 
size circuit instances. A first step towards making use of the FM9001 generator 
work would be to generate DUAL-EVAL code and to perform proofs about that 
code. However, we have long considered a move to using first order provers and 
inductive proofs of recursive circuits. Our emphasis on the use of higher order 
functions gives our circuit descriptions (and our use of shadow variables and 
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clever components) a different style from those in the FM9001 work. The next 
step will be to find good ways to combine the best of both approaches. 

8 Conclusion 

We have presented a collection of methods that together allow us to describe and 
analyse circuits that are not quite regular. We distinguish circuit generation time 
from circuit analysis time, and there is a clear analogy with compile time and 
run time, and with static and dynamic semantics in VHDL. The aim of circuit 
generation is to produce a representation (in terms of a suitable recursive data 
type) of a complete fixed-size circuit, something very close to a netlist. Circuit 
analysis is what happens when we turn this representation into various other 
notations, in order to scrutinise it further, often with the help of external tools 
such as SAT-solvers and model checkers. Simulation is one such analysis. 

During circuit generation, we use the power of Haskell to control the process 
of generating the required netlist. Special values called shadow values are asso- 
ciated with the circuit level values, and can be used to control the generation 
process. They can be static, like the shadow Booleans that we use when omit- 
ting unwanted parts of networks, or dynamic, like the address lists that we used 
to track progress towards a target in the median circuit example. The shadow 
values can also encode information about the circuit that feeds a component, 
allowing the component itself to decide what circuit would best be introduced 
into the network at that point. These elever components are likely to have many 
applications. For example, the “try it and see” could extend to calling external 
tools like, say, automatic place and route tools with possible circuits that might 
be included in the final design, and then picking the one that gives the best result 
according to some criterion such as timing, testability or power consumption. We 
have not yet incorporated the notion of layout into this work but that will be 
the next step. At Chalmers, we are developing a language that captures wires 
and layout explicitly, but uses Lava functions for describing circuit function. It 
can be seen as a generalisation of layout combinators [3], and we have had to 
move from 2-dimensional tiles to 3-dimensional blocks. We aim to be able to 
capture the ways in which regular circuits become irregular during the design 
process, for example when they are designed to fit under a particular intercon- 
nect fabric. Our intention is to combine the design methods illustrated here with 
circuit analyses that capture wire length and related non-functional properties. 
Thus, it is the problem of how to do interconnect-aware design that is the main 
motivation for this research. 

However, there is a second motivation, the need to push formal verification 
earlier in the design process. We had speculated that clever components would 
allow sub-parts of circuits to be verified during circuit generation, but had not 
yet performed any experiments in this area. The FM9001 work shows, very 
convincingly, that these ideas enable both hierarchical proofs and the generation 
of circuits that are built for verifiability. That the FM9001 proof can simply be 
rerun for any size is an extremely important property of the verification effort. 
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The use of verified circuit generators in the FM9001 work goes beyond what we 
had envisaged. We feel spurred on to investigate ways to support inductive proofs 
of recursive circuit generators based on Lava combinators, while still aiming for 
as much proof automation as possible. 

Finally, we would like to investigate whether or not our methods, and in 
particular clever components, could be applied to the description and analysis 
of reconfigurable circuits. 
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Abstract. Predicate abstraction is a popular abstraction technique em- 
ployed in formal software verification. A crucial requirement to make 
predicate abstraction effective is to use as few predicates as possible, 
since the abstraction process is in the worst case exponential (in both 
time and memory requirements) in the number of predicates involved. If 
a property can be proven to hold or not hold based on a given finite set of 
predicates V, the procedure we propose in this paper finds automatically 
a minimal subset of V that is sufficient for the proof. We explain how 
our technique can be used for more efficient verification of C programs. 
Our experiments show that predicate minimization can result in a sig- 
nihcant reduction of both verification time and memory usage compared 
to earlier methods. 



1 Introduction 

Predicate abstraction [13] is a commonly used abstraction technique in formal 
verification of both software and hardware. Like other abstractions, when suc- 
cessful it can be used to prove the correctness (or incorrectness) of a property 
with only partial information about the reachable states of the system. This 
facilitates the verification of systems larger than would otherwise be possible. 
Predicate abstraction has been used widely both for hardware [5] and software [2, 
9] verification. In this article we focus on its application to the verification of C 
programs. 

Verification of programs typically concentrates on the control flow of the 
program (e.g. checking if a particular control point is reachable), rather than 
on the data manipulated by it (e.g. checking functional correctness). Predicate 
abstraction is a common abstraction technique used in this context. Given a 
program 77 and a set of predicates V, verification with predicate abstraction 
consists of constructing and analyzing an automaton A(77, V), an abstraction of 
77 relative to V. 
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We will describe in more detail predicate abstraction for verification of C 
programs in section 2. For now let us just mention that the process of construct- 
ing A{n,V) is in the worst case exponential, both in time and space, in \V\. 
Therefore a crucial point in deriving efficient algorithms based on predicate ab- 
straction is the choice of a small set of predicates. In other words, one of the main 
challenges in making predicate abstraction effective is distinguishing a small set 
of predicates that are sufficient for determining whether a property holds or not. 
In this article we present an automated technique for finding the minimal such 
set from a given set of candidate predicates. 

In the original article describing predicate abstraction [13] the process of 
selecting predicates is done manually. An automatic method for choosing pred- 
icates was suggested by Ball and Rajamani [2]. They follow a CounterExample 
Guided Abstraction Refinement (CEGAR) loop, which we now describe. Let 4> 
be the property that we wish to verify over the program U. We denote by MC 
a model checking algorithm that takes both A{U,'P) and as inputs and out- 
puts TRUE if A{n, V) \= (j) and a counterexample t otherwise. We assume (j) is a, 
safety property, so that r is a finite acyclic trace of A{II,'P). Since r is a trace 
of A{n,V), it is often called an abstract trace. Let 7 be a trace concretization 
function that maps every abstract trace to a sequence of instructions of U con- 
sistent with the control flow graph. In order to check whether this sequence is a 
valid trace of 77, we define a Trace Checking algorithm TC that takes 77 and r 
as inputs and returns true if j(t) is a valid trace of 77 and false otherwise. In 
the latter case r is called a spurious counterexample. Finally, if r is spurious, we 
need to eliminate it from the abstract model. We say that a set of predicates P' 
eliminates r iff for every trace t' of A(77, V), 7 (r) yf 7 (t'); i.e. , the concretiza- 
tion of all traces in A(77, V) are different from 7 (r). Given these definitions, we 
now describe the four steps of the GEGAR loop (usually V = % initially): 

1. Abstract. Gonstruct A{II,P). 

2. Verify. If MC{A{II ,V) , (j>) = true, return property holds. 

Otherwise let r be the counterexample. 

3. Check. If PC{II,t) = true return property does not hold. 

4. Refine. Update P so as to eliminate r. Go to step 1. 

Step 4 is the crucial one, and also the subject of this article. In previous work [2, 
9] the refinement is done by adding predicates that eliminate the new spurious 
counterexample while maintaining the predicates that were found in previous 
iterations. This guarantees that no spurious counterexample will be repeated. 
However, this accumulative approach cannot guarantee a minimal set of predi- 
cates, because it depends on the order in which the counterexamples are identi- 
fied and the choice of predicates at each step. 

For example, consider a scenario where the first counterexample, ti, can be 
eliminated by either pi or p2, and the process chooses p\. Now it finds another 
counterexample, T 2 , which can only be eliminated by the predicate p 2 . The pro- 
cess now proceeds with both pi and p2, although p2 by itself is sufficient to 
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eliminate both t\ and T 2 - The framework that we present in this article, on the 
other hand, finds a minimal set of predicates that eliminate all the spurious 
counterexamples discovered so far. This guarantees a minimal set of predicates 
throughout the process, which is expected to reduce the overall verification time 
and required space. Our experimental results show that indeed the number of 
predicates and consequently the amount of memory required are significantly 
reduced. 



Related Work. Predicate abstraction was introduced by Graf and Saidi in [13]. 
It was subsequently used with considerable success in both hardware and soft- 
ware verification [2,8,9]. The notion of CEGAR was originally introduced by 
Kurshan [10] (originally termed localization) for model checking of finite state 
models. Both the abstraction and refinement techniques for such systems, as 
applied in his and consequent works, are essentially different than the predicate 
abstraction approach we follow. For example, abstraction in localization reduc- 
tion is done by assigning non-deterministic values to selected sets of variables, 
while refinement corresponds to gradually returning to the original definition of 
these variables. More recently the GEGAR framework has also been successfully 
adapted for verifying infinite state systems [12], and in particular software [3,9]. 
The problem of finding small sets of predicates (yet not minimal) is also being 
investigated in the context of hardware designs in [5] . 

The rest of this article is structured as follows. In the next section we discuss 
in more detail the GEGAR loop for predicate abstraction and how it is used 
for verifying G programs. In section 3 we describe in detail the procedure for 
selecting a minimal set of predicates. In section 4 we present the results of 
applying our technique to several realistic examples and detail our conclusions. 



2 Predicate Abstraction/Refinement for C Programs 

In the introduction we discussed the overall structure of a GEGAR loop. In this 
section we explain how this framework can be applied for verifying G programs. 
We do so by describing how the various basic blocks of the GEGAR loop are 
implemented. In particular, we discuss the construction of A(il, V) in section 
2 . 1 , the notion of trace concretization ( 7 ) in section 2 . 2 , the trace checking algo- 
rithm 7T1 in section 2.3, and a method for checking whether a set of predicates 
eliminates a spurious counterexample in section 2.4. 

2.1 Constructing the Abstract Model 

We begin with the process of constructing A{II^V) given a G program 77 and 
an initial set of predicates V. For the sake of simplicity, we assume that 77 con- 
sists of a single monolithic G main procedure obtained via inlining (we disallow 
function pointers and recursion in order to make inlining effective) . Without loss 
of generality, we can assume that there are only four kinds of statements in 77: 
assignments, if-then-else branches, goto and return. We denote by Stmt the 
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set of statements of II and by Exp the set of all pure (side-effect free) C expres- 
sions over the variables of U . As a running example we use the following simple 
C program and the property that label L4 is unreachable. 

int x,y; 

LO: X = 1; 

LI: y = 1; 

L2: if (x == y) 

L3: y = 1; 

L4: else y = 2; 

Initial Abstraction with Control Flow Automata. The construction of 
A{n,V) begins with the construction of the control flow automaton (CFA) of 
n . The states of a CFA correspond to control points in the program. The transi- 
tions between states in the CFA correspond to possible transitions between their 
associated control points in the program, assuming that every branch in the pro- 
gram can he taken. Thus, a CFA of a program is a conservative abstraction of 
the program’s control flow, i.e. it allows a superset of the possible traces of the 
program. 

Formally the CFA is a 4-tuple {ScFAcF,TcF,hl) where: 

— ScF is a set of states. 

— I CF & ScF is an initial state. 

~ Tcf C Scf X ScF is a set of transitions. 

— C : Scf \ {final} — >■ Stmt is a labeling function. 

Scf contains a distinguished final state which does not belong to the domain 
of C. The transitions between states reflect the flow of control between their 
labeling statements: £{Icf) is the initial statement of II and (si,S 2 ) G Tcf iff 
one of the following conditions hold: 

— £(si) is an assignment or goto with £(s 2 ) as its unique successor. 

— £(si) is a branch with £( 52 ) as its then or else successor. 

— £(si) is a return statement and S 2 = final. 

The CFA is equivalent, as we will shortly see, to 

Example 1. The CFA of our example program is shown in Figure 1(a), where 
every state s is labeled with £(s). Henceforth we will refer to each CFA state by 
the corresponding statement label. We will use final for the final state. Therefore 
the states of the CFA in Figure 1(a) are LO . . .L4 and final with LO being the 
initial state. □ 



Predicate Inference. The main challenge in predicate abstraction is to iden- 
tify the predicates that are necessary for proving the given property. In our 
framework we require P to be a subset of the branch statements in 77. Therefore 
we sometimes refer to V or subsets of V simply as a set of branches, where the 
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Fig. 1. (a) The CFA for our example program, (b) The CFA labeled with inferred 
predicates ifV — {{x == j/)}, i.e., it contains the only branch in the program, and (c) 
The abstract automaton A(n,V), which proves that L4 is not reachable. 

actual meaning is the predicates that serve as the guards in these branches. The 
construction of A{n, V) associates with each state s of the CFA a finite subset of 
Exp derived from V, denoted by Vs- The process of constructing the Pg’s from 

V is known as predicate inference and is described by the algorithm Predinfer 
in Figure 2. Note that Vs is always 0 if s is either the final state or £(s) is a 
return statement. 

The algorithm uses a procedure for computing the weakest precondition WV 
of a predicate p relative to a given statement. We define WV in the same way as 
Ball and Rajamani [2]. First, consider a C assignment statement a of the form 

V = e;. Let tp be a pure C expression (ip £ Exp). Then the weakest precondition of 

if with respect to a, denoted by YdV{ip^ a) is obtained from ip by replacing every 
occurrence of u in tp with e. A second case considers a C assignment statement 
a in which e is assigned to a variable whose address is stored in v, i.e. a is of 
the form *v = e;. Let {ui, . . . ,u„} be the set of variables appearing in ip and 
for 1 < i < n let Oi be the assignment statement Vi = e; WV{ip,a) is then: 
(ll"=i((^^ == ^Vi) && WV(ip, at))) II = &Vi)) && ip) 

The weakest precondition is clearly an element of Exp as well. The purpose 
of predicate inference is to create Vs ’s that lead to a very precise abstraction of 
the program relative to the predicates in V . Intuitively, this is how it works. Let 
s,t £ ScF such that £(s) is an assignment statement and (s,t) £ Tcf- Suppose 
a predicate pt gets inserted in Vt at some point during the execution of Predinfer 
and suppose Ps = WV{pt,E{s)). Now consider any execution state of II where 
the control has reached £{t) after the execution of £{s). It is obvious that pt 
will be true in this state iff Ps was true before the execution of C(s). In terms 
of the CFA, this means that the value of pt after a transition from s to t can be 
determined precisely on the basis of the value of Ps before the transition. This 
motivates the inclusion of Ps in Vs- The cases in which C{s) is not an assignment 
statement can be explained analogously. 
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Input : Set of branch statements V 

Output: Set of Vs’s associated with each CFA state 
Initialize: Vs G ScF,Vs ~ 0 
Forever do 

For each s G ScF do 

If C{s) is an assignment statement and C{s') is its successor 
For each p' (zVa' add WV{p' , C{s)) to Va 
Else if C{s) is a branch statement with condition c 
If C{s) G V add c to Va 

If C{s') is a ‘then’ or ‘else’ successor of C{s) , Va ■= VaUVa' 

Else If £{s) is a ‘goto’ statement with successor £{s'), Va := VaUVs' 
If no Va was modified in the ‘for’ loop, exit 

Fig. 2. Algorithm Predinfer for predicate inference. 



Note that Predinfer may not terminate in the presence of loops in the CFA. 
However, this does not mean that our approach is incapable of handling C pro- 
grams containing loops. In practice, we force termination of Predinfer by limiting 
the maximum size of any Va- Using the resulting P^’s, we can compute the states 
and transitions of the abstract model as described in the next section. Irrespec- 
tive of whether Predinfer was terminated forcefully or not, the resulting model 
is guaranteed to be a sound abstraction of II. We have found this approach to 
be very effective in practice. A similar algorithm was proposed by Dams and 
Namjoshi [7]. 

Example 2. Consider the CFA described in Example 1. Suppose V contains the 
only branch (L2) in our example program. Then Predinfer begins with Vl 2 = 
{(x == y)}. From this it obtains Vli = {WV((x == y),y = 1;)} = {(x == 1)} 
and then V^o = {WP((x == l),x = 1;)} = {(1 == 1)}. As (1 == 1) is trivially 
true, we do not include it in Vlo- Thus Vw = 0- Finally Vl 3 = Vu = V final = 0- 
Figure 1(b) shows the CFA with each state s labeled on the outside hy Vs- □ 



The States and Transitions of the Abstract Model. So far we have de- 
scribed a method for computing the initial abstraction (the CFA) and a set of 
predicates associated with each location in the program. The states of the ab- 
stract system A{II, V) correspond to the various possible valuations of the predi- 
cates in each location (this is the reason why the abstract graph is exponential in 
the number of predicates). Formally, for a CFA node s suppose Vs = {Pl ^ . . . ,Pfc}. 
Then a valuation of Vs is a boolean vector vi, ... ,Vk- Let Vs be the set of all pred- 
icate valuations of Vs- Then the predicate coneretization function Fs : Vs — f Exp 
is defined as follows. Given a valuation V = {xi, . . . , Vk} G Vs, Es{V) = A^=i pT 
where = Pi = -ip^. As a special case, if Vs = 0, then 

Vs = {T} and Es{l.) = true. 

Example 3. Suppose Vs = {(a == 0),(6 > 5),(c < d)}, Vi = {0,1,1} and 
V2 = {1,0,1}. Then EsiVi) = == 0)) A (6 > 5) A {c < d) and 

Us(V2) = (a == 0) A (-•(b > 5)) A (c < d). □ 
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Computing the transitions between the states in A{II, V) requires a theorem 
prover. We add a transition between two abstract states unless we can prove that 
there is no transition between their corresponding concrete states. If we cannot 
prove this, we say that the two states (or the two formulas representing them) 
are admissible. This problem can be reduced to the problem of deciding whether 
A 1 P 2 ) is valid, where ipi and ')/'2 are arbitrary quantifier free first order 
logic formulas. In general this problem is known to be undecidable. However 
for our purposes it is sufficient that the theorem prover be sound and always 
terminate. Several publicly available theorem provers (such as Simplify [11]) 
have this characteristic. 

Given arbitrary formulas i/'i and il) 2 , we say that the formulas are admissi- 
ble if the theorem prover returns false or unknown on -•('i/'i A 1 ^ 2 )- We de- 
note this by Adm{'tpi,tp 2 )- Otherwise the formulas are inadmissible, denoted by 
-•Adm{'tpi,'tp2). 



A Procedure for Constructing P). We now define A{U, V). Formally, 

it is a triple {Sj^AaiTj) where: 

- Sji, = UsgSpjj{s} X Vs is the set of states. 

- Ia = {Icf} X VicF is the initial set of states. 

^ Ta Q Sa X Sa is the transition relation, defined as follows: 
((si, Vi), (s 2 , V 2 )) G Ta iff (si,S 2 ) G Tcf and one of the following condi- 
tions hold: 

1. /l(si) is an assignment statement and 

Adm{r,, (Vi), WP(Ts. (V 2 ), £(si))). 

2. /l(si) is a branch statement with a branch condition c, C{s 2 ) is its then 
successor, Hdm(Tsj (Vi), (V 2 )) and Adm{rs^{Vi),c). 

3. >C(si) is a branch statement with a branch condition c, C{s 2 ) is its else 
successor, Adm(Tsi (Vi), (V 2 )) and Adm{rsi{Vi),~'c). 

4. £(si) is a goto statement and Adm(Tsj(Vi), 1 ^ 82 ( 12 )). 

5. £(si) is a return statement and S 2 is the final state. 

Example f. Recall the CFA from Example 1 and the predicates corresponding to 
CFA nodes discussed in Example 2. The x4(i7, P) obtained in this case appears 
in Figure 1(c). Let us see why there is a transition from (L0,T) to (LI, true). 
Since £(L0) is an assignment statement, by rule 1 above we compute the following 
expressions: 

- Ilo(-L) = true 

- /1i(true)= {x == 1). 

- £(L0) = {x = l) 

- WP(/1i(true), £(L0)) = Wr{(x == l),x = 1;) = (1 == 1) = true 

- Adm(TRUE,TRUE). 

Thus, we add a transition from (LO, T) to (LI, true). Examining a possible tran- 
sition from (LO, T) to (LI, FALSE), we similarly compute /Ii(false) = (-•(x == 
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Input: A trace r of A{II^V) s.t. 7 (t) = (si, . . . , Sn) 

Output: TRUE iff r is valid (can be simulated on the concrete system) 
Variable: X of type formula 
Initialize: X := TRUE 
For i = n to 1 

If Si is an assignment 
X := WV{X,Si) 

Else If Si is a branch with condition c 
If (i < n) 

If Si+i is the ‘then’ successor of Si , X := XAc 
else X : = X A -ic 
If (X = false) return FALSE 
Return TRUE 



Fig. 3. Algorithm TC to check the validity of a trace of U. 



1)) and yVP{{-'{x == 1)), x = 1; ) = (-■(1 == 1)). Since -■A(im(TRUE, (-i(l == 
1))), there is no transition between these two abstract states. The presence or 
absence of other transitions can be explained in a similar manner. As no state 
labeled by L4 is reachable, we have proven that our example property holds. □ 

Clearly, if we do not limit the size of Vs, |>S'_ 4 | is exponential in \V\. Hence so are 
the worst case space and time complexities of constructing A{n,V). 

2.2 Trace Concretization 

A trace of A{n,V) is a finite sequence ((si, Vi), . . . , (s„, C„)) such that (i) 
for 1 < i < n, (si,Vi) G 5”^, (ii) (si,Vi) G and (iii) for 1 < i < n, 
((sj,Ci),(si+i,V*+i)) G T^. Given such a trace r = ((si, Hi), . . . , (s„, C„)) of 
A{n,V), the concretization of t is defined as j ( t ) = (£(si), . . . ,£($„)}■ Thus, 
the concretization of an abstract trace is a trace of IT: a sequence of statements 
that correspond to some trace in the control flow graph of IT. 

2.3 Trace Checking 

The VC algorithm, described in Figure 3, takes IT and a counterexample r as 
inputs and returns true if 7 (r) is a valid trace of IT . This is a backward traversal 
based algorithm. There is an equivalent algorithm [3] that is forward traversal 
based and uses strongest postconditions instead of weakest preconditions. 

2.4 Checking Trace Elimination 

Given a spurious counterexample r = ((si, Gi), . . . , (s„, G„)) and a set of 
branches 'P, we will need to determine if P eliminates t. To do so we: (i) con- 
struct A{TJ,P) and (ii) determine if there exists a trace r' of A{TJ,P) such that 
7 (r) = 7 ( 7 "'). The algorithm, called TraceEliminate, is described in Figure 4}. 

^ Note that in practice this step can be carried out in an on-the-fly manner without 
constructing the full A{IT,P). 
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Input: Spurious trace r s.t. 'y{r) = {si, . . . , Sn) and a set of predicates "P 
Output: TRUE if r is eliminated by "P and FALSE otherwise 
Compute A{n,V) = {Sa,Ia,Ta} 

Variable : X,Y of type subset of Sa 
Initialize : X := {(s, P) G | s = Si} 

If {X = 0) return TRUE 
For i = 2 to n do 

y := {{s',V) G Sa\{s' = Si) A3{s,V) G X . {{s,V),{s',V')) GTa} 

If (y = 0) return TRUE 

X := y 

Return FALSE 

Fig. 4. Algorithm TraceEliminate to check if a spurious trace can be eliminated. 



3 Predicate Minimization 

We now present the algorithm for discovering a minimal set of branches P of a 
program tt that will help us prove or disprove a safety property (j). 



3.1 The Sample- and- Eliminate Algorithm 

Algorithm Sample- and- Eliminate, described in Figure 5, is based on an abstrac- 
tion refinement loop that keeps the set of predicates minimal throughout the 
process. It is modeled after the Sample- and- Separate algorithm [6], where it is 
used in a CEGAR framework for hardware verification. At each step it finds a 
counterexample if one exists and checks whether it corresponds to a concrete 
counterexample, as usual. Unlike previous approaches [3,9], however, it finds a 
minimal set of predicates that eliminate all the concrete spurious traces that 
were found so far (in the last line of the loop.) Our approach to solving this 
minimization problem is the subject of Section 3.2. 



Input: Program 77, safety property (f> 

Output: TRUE if proved that U \= (f), FALSE if proved Ft (fi, and UNKNOWN 
otherwise . 

Variable: T set of spurious counterexamples, P set of predicates 
Initialize: T := 0 , P 0 
Forever do 

If MC{A{n, P),(f>) = TRUE return TRUE 
Else let r be the abstract counterexample 
If TC{t) = TRUE return FALSE 

If P is the set of all branches in 77 then return UNKNOWN 
T := TU {r} 

P := minimal set of branches of Ft that eliminates all elements of T 

Fig. 5. Algorithm Sample- and- Eliminate uses a minimal set of predicates taken from 
a program’s branches to prove or disprove 77 |= (j>, if such a proof is possible. 
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3.2 Minimizing the Eliminating Set 

The last line of Sample-and-Eliminate presents the following problem: given a 
set of spurious counterexamples T and a set of candidate predicates P (all the 
branches of II in our case), find a minimal set p Q P which eliminates all the 
traces in T. We present a three step algorithm for solving this problem. First, 
find a mapping T >->• 2^ between each trace in T and the set of sets of predicates 
in P that eliminate it. This can be achieved by iterating through every p Q P and 
T G T, using TraceEliminate to determine if p can eliminate t. This approach 
is exponential in |P| but below we list several ways to reduce the number of 
attempted combinations: 

— Limit the size or number of attempted combinations to a small constant, e.g. 
5, assuming that most traces can be eliminated by a small set of predicates. 

— Stop after reaching a certain size of combinations if any eliminating solutions 
have been found. 

— Break up the control flow graph into blocks and only consider combinations 
of predicates within blocks (keeping combinations in other blocks fixed) . 

~ Use data flow analysis to only consider combinations of related predicates. 

— For any r G T, if a set p eliminates r, ignore all supersets of p with respect 
to r (as we are seeking a minimal solution). 

Second, encode each predicate Pi G P with a new Boolean variable p\. We use 
the terms ‘predicate’ and ‘the Boolean encoding of the predicate’ interchange- 
ably. Third, derive a Boolean formula cr, based on the predicate encoding, that 
represents all the possible combinations of predicates that eliminate the elements 
of T. We use the following notation in the description of a. Let r G T be a trace: 

~ kr denotes the number of sets of predicates that eliminate r (1 < < 21-^1). 

— s(r, z) denotes the z-th set (1 < z < kr) of predicates that eliminates r. We 
use the same notation for the conjunction of the predicates in this set. 

The formula cr is defined as follows: 



TSTi=l 

For any satisfying assignment to cr, the predicates whose Boolean encodings are 
assigned true are sufficient for eliminating all elements of T. 

From the various possible satisfying assignments to a, we look for the one 
with the smallest number of positive assignments. This assignment represents 
the minimal number of predicates that are sufficient for eliminating T. Since 
cr includes disjunctions, it cannot be solved directly with a 0-1 ILP solver. We 
therefore use PBS [1], a solver for Pseudo Boolean Formulas. 

A pseudo-Boolean formula is of the form ^ where bi is a Boolean 

variable and Ci is a rational constant for 1 < z < rz. fc is a rational constant and 
txi represents one of the inequality or equality relations ({<,<,>,>, =}). Each 
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such constraint can be expanded to a CNF formula (hence the name pseudo- 
Boolean), but this expansion can be exponential in n. PBS does not perform 
this expansion, but rather uses an algorithm designed in the spirit of the Davis- 
Putnam-Loveland algorithm that handles these constraints directly. PBS accepts 
as input standard CNF formulas augmented with pseudo-Boolean constraints. 
Given an objective function in the form of pseudo-Boolean formula, PBS finds 
an optimal solution by repeatedly tightening the constraint over the value of this 
function until it becomes unsatisfiable. That is, it first finds a satisfying solution 
and calculates the value of the objective function according to this solution. It 
then adds a constraint that the value of the objective function should be smaller 
by one. This process is repeated until the formula becomes unsatisfiable. The 
objective function in our case is to minimize the number of chosen predicates 
(by minimizing the number of variables that are assigned true): 

n 

min^p,^ (2) 



Example 5. Suppose that the trace ti is eliminated by either {pi,P 3 ,P 5 } or 
{P 2 iPd} and that the trace T 2 can be eliminated by either {p 2 ,Ps} or {pi}. 
The objective function is minX^^iPi i® subject to the constraint: 

a = ((p^ Ap|Ap^)V(p^Ap|))A 
((p^Ap|)V(pi)) 

The minimal satisfying assignment in this case is p\= p\= p\ = true. □ 

Other techniques for solving this optimization problem are possible, including 
minimal hitting sets and logic minimization. The PBS step, however, has not 
been a bottleneck in any of our experiments. 



4 Experiments and Conclusions 

We implemented our technique inside the MAGIC [4] tool. MAGIC was designed 
to check weak simulation of properties of labeled transition systems (LTSs) de- 
rived from C programs. We experimented with MAGIC with and without predi- 
cate optimization. We also performed experiments with a greedy predicate mini- 
mization strategy implemented on top of MAGIC. In each iteration, this strategy 
first adds predicates sufficient to eliminate the spurious counterexample to the 
predicate set V. Then it attempts to reduce the size of the resulting V by using 
the algorithm described in Figure 6. The advantage of this approach is that it re- 
quires only a small overhead (polynomial) compared to Sample- and- Eliminate, 
but on the other hand it does not guarantee an optimal result. Further, we 
performed experiments with Berkeley’s BLAST [9] tool. BLAST also takes C 
programs as input, and uses a variation of the standard CEGAR loop based 
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Input: Set of predicates V 

Output: Subset of V that eliminates all spurious counterexamples so far 
Variable: X of type set of predicates 
LOOP: Create a random ordering {pi , . . . , pk) of "P 
For i = 1 to fc do 
X:=V\ {pi} 

If X can eliminate every spurious counterexample seen so far 
V :=X 
Goto LOOP 
Return V 



Fig. 6. Greedy predicate minimization algorithm. 



on lazy abstraction, but without minimization. Lazy abstraction refines an ab- 
stract model while allowing different degrees of abstraction in different parts of 
a program, without requiring recomputation of the entire abstract model in each 
iteration. Laziness and predicate minimization are, for the most part, orthogonal 
techniques. In principle a combination of the two might produce better results 
than either in isolation. 

Benchmarks. We used two kinds of benchmarks. A small set of relatively 
simple benchmarks were derived from the examples supplied with the BLAST 
distribution and regression tests for MAGIC. The difficult benchmarks were de- 
rived from the C source code of openssl-0 . 9 . 6c, several thousand lines of code 
implementing the SSL protocol used for secure transfer of information over the 
Internet. A critical component of this protocol is the initial handshake between 
a server and a client. We verified different properties of the main routines that 
implement the handshake. The names of benchmarks that are derived from the 
server routine and client routine begin with ssl-srvr and ssl-clnt respec- 
tively. In all our benchmarks, the properties are satisfied by the implementation. 
The server and client routines have roughly 350 lines each but, as our results 
indicate, are non-trivial to verify. 

Results. Figure 7 summarizes our results. Time for all experiments is given in 
seconds. All experiments were performed on an AMD Athlon XP 1600 machine 
with 900 MB of RAM running RedHat 7.1. The column Iter reports the number 
of iterations through the CEGAR loop necessary to complete the proof. Predi- 
cates are listed differently for the two tools. For BLAST, the first number is the 
total number of predicates discovered and used and the second number is the 
number of predicates active at any one point in the program (due to lazy ab- 
straction this may be smaller) . In order to force termination we imposed a limit 
of three hours on the running time. We denote by in the Time column exam- 
ples that could not be solved in this time limit. In these cases the other columns 
indicate relevant measurements made at the point of forceful termination. 

For MAGIG, the first number is the total number of expressions used to prove 
the property, i.e. | Vs\- The number of predicates (the second number) 
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Fig. 7. Results for BLAST and MAGIC with different refinement strategies. indicate 
run-time longer than 3 hours. ‘ x ’ indicate negligible values. Best results are emphasized. 



may be smaller, as MAGIC combines multiple mutually exclusive expressions 
(e.g. X == 1, X < 1, and x > 1) into a single, possibly non-binary predicate, 
having a number of values equal to the number of expressions (plus one, if the 
expressions do not cover all possibilities.) The final number for MAGIC is the 
size of the final V. For experiments in which memory usage was large enough to 
be a measure of state space size rather than overhead, we also report memory 
usage (in megabytes). 

The first MAGIC results are for the MAGIC tool operating in the standard 
refinement manner: in each iteration, predicates sufficient to eliminate the spu- 
rious counterexample are added to the predicate set. The second MAGIC results 
are for the greedy predicate minimization strategy. The last MAGIC results are 
for predicate minimization. Rather than solving the full optimization problem, 
we simplified the problem as described in section 3. In particular, for each trace 
we only considered the first 1,000 combinations and only generated 20 eliminat- 
ing combinations. The combinations were considered in increasing order of size. 
After all combinations of a particular size had been tried, we checked whether 
at least one eliminating combination had been found. If so, no further combina- 
tions were tried. In the smaller examples we observed no loss of optimality due 
to these restrictions. We also studied the effect of altering these restrictions on 
the larger benchmarks and we report on our findings later. 




32 



S. Chaki et al. 





1 ssl-srvr-4 


1 ssl-srvr-15 I 


ssl-clnt-1 1 


ELM 


SUB 


Time 


It 


\v\ 


Mem 


TG 


MG 


Time 


It 


\v\ 


Mem 


TG 


MG 


Time 


It 


\v\ 


Mem 


TG 


MG 


50 


250 


656 


8 


2 


64 


34 


1 


1170 


15 


3 


72 


86 


1 


1089 


13 


2 


67 


66 


1 


100 


250 


656 


8 


2 


64 


34 


1 


1169 


15 


3 


72 


86 


1 


1089 


13 


2 


67 


66 


1 


150 


250 


657 


8 


2 


64 


34 


1 


1169 


15 


3 


72 


86 


1 


1090 


13 


2 


67 


66 


1 


200 


250 


656 


8 


2 


64 


34 


1 


1170 


15 


3 


72 


86 


1 


1089 


13 


2 


67 


66 


1 


250 


250 


656 


8 


2 


64 


34 


1 


1168 


15 


3 


72 


86 


1 


1090 


13 


2 


67 


66 


1 



Fig. 8. Results for optimality. ELM = MAXELM, SUB = MAXSUB, It is the number 
of iterations, TG is the total number of eliminating subsets generated, and MG is the 
maximum size of any eliminating subset generated. 



For the smaller benchmarks, the various abstraction refinement strategies 
do not differ markedly. However, for our larger examples, taken from the SSL 
source code, the refinement strategy is of considerable importance. Predicate 
minimization, in general, reduced verification time (though there were a few 
exceptions to this rule, the average running time was considerably lower than 
for the other techniques, even with the cutoff on the running time). Moreover, 
predicate minimization reduced the memory needed for verification, which is an 
even more important bottleneck. Given that the memory was cutoff in some 
cases for other techniques before verification was complete, the results are even 
more compelling. 

The greedy approach kept memory use fairly low, but almost always failed 
to find near-optimal predicate sets and converged much slower than the usual 
monotonic refinement or predicate minimization approaches. Further, it is not 
clear how much final memory usage would be improved by the greedy strategy 
if it were allowed to run to completion. Another major drawback of the greedy 
approach is its unpredictability. We observed that on any particular example, the 
greedy strategy might or might not complete within the time limit in different 
executions. Clearly, the order in which this strategy tries to eliminate predicates 
in each iteration is very critical to its success. Given that the strategy performs 
poorly on most of our benchmarks using a random ordering, more sophisticated 
ordering techniques may perform better. We leave this issue for future research. 



Optimality. We experimented with two of the parameters that affect the op- 
timality of our predicate minimization algorithm: (i) the maximum number of 
examined subsets (MAXSUB) and (ii) the maximum number of eliminating sub- 
sets generated (MAXELM) (that is, the procedure stops the search if MAXELM 
eliminating subsets were found, even if less than MAXSUB combinations were 
tried) . We first kept MAXSUB fixed and took measurements for different values 
of MAXELM on a subset of our benchmarks viz. ssl-srvr-4, ssl-srvr-15 and 
ssl-clnt-1. Our results, shown in Figure 8, clearly indicate that the optimality 
is practically unaffected by the value of MAXELM. 

Next we experimented with different values of MAXSUB (the value of MAX- 
ELM was set equal to MAXSUB). The results we obtained are summarized in 
Figure 9. It appears that, at least for our benchmarks, increasing MAXSUB 
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Fig. 9. Results for optimality. SUB = MAXSUB, It is the number of iterations, TG is 
the total number of eliminating subsets generated, MT is the maximum size of subsets 
tried, and MG is the maximum size of eliminating subsets generated. 



leads only to increased execution time without reduced memory consumption or 
number of predicates. The additional number of combinations attempted or con- 
straints allowed does not lead to improved optimality. The most probable reason 
is that, as shown by our results, even though we are trying more combinations, 
the actual number or maximum size of eliminating combinations generated does 
not increase significantly. It would be interesting to investigate whether this is 
a feature of most real-life programs. If so, it would allow us, in most cases, to 
achieve near optimality by trying out only a small number of combinations or 
only combinations of small size. 
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Abstract. This paper presents a method for taking advantage of the 
efficiency of symbolic model checking using disjunctive partitions, while 
keeping the number and the size of the partitions small. We define a 
restricted form of a Kripke structure, called an or-structure, for which 
it is possible to generate small disjunctive partitions. By changing the 
image and pre-image procedures, we keep even smaller partial disjunctive 
partitions in memory. In addition, we show how to translate a (software) 
program to an or-structure, in order to enable efficient symbolic model 
checking of the program using its disjunctive partitions. We build one 
disjunctive partition for each state variable in the model directly from 
the conjunctive partition of the same variable and independently of all 
other partitions. This method can be integrated easily into existing model 
checkers, without changing their input language, and while still taking 
advantage of reduction algorithms which prefer conjunctive partitions. 



1 Introduction 

Symbolic model checking suffers from the known problem of state explosion. This 
explosion usually happens while performing the image or pre-image computation. 
In order to cope with this problem, symbolic model checkers use partitioned 
transition relations [8]. Using ordered conjunctive partitioning [7] is quite simple 
and sometimes allows early quantification while computing the image or pre- 
image; this serves to decrease the needed memory. 

The RuleBase model checker [1] uses ordered conjunctive partitioning, and 
previous work showed its application to general purpose software [5,6]. In this 
paper, we show how disjunctive partitioning can be used to increase the efficiency 
of symbolic model checking for software. 

Disjunctive partitioning, first introduced in [8], has several advantages over 
conjunctive partitioning. First, both image and pre-image computations are more 
efficient using disjunctive partitions, since quantification distributes over dis- 
junction but not over conjunction [9,8]. For the same reason, distributed model 
checking using disjunctive partitions is also more scalable than using conjunctive 
partitioning, since each process can do the quantification on its own. As a result, 
the “heavy” computation is divided by the number of processes. 

Despite the advantages of disjunctive partitioning, use of the technique is 
generally hindered by the difficulty in building the partitions. The method pre- 
sented in [8] is efficient only for asynchronous circuits. It builds the disjunctive 
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partitions using an interleaving model, which allows only one wire to change its 
value at a time. 

Both [2] and [4] suggested how to build disjunctive partitions for synchronous 
circuits. In [2], we see how to decompose an FSM into smaller FSMs, and then 
use this decomposition to split the conjunctive partitioned transition relation 
into a disjunction of conjunctive partitioned transition relations. In [4], a set 
of mutually exclusive events is used to decompose the behavior of the circuit 
to disjunctive partitions. Large disjunctive partitions are split into conjunctive 
partitions, which results in a DNF partitioning as in [2]. Both methods need 
additional information on the circuit in order to get a good decomposition. 

Disjunctive partitioning is also used in [10], where each transition is a separate 
disjunctive partition. The contribution of [10] is in presenting the order in which 
the transitions should be executed in order to achieve improved performance. 

While all the above works are applicable to models generated for software, 
applying them to software is problematic. The method of [8] is applicable to 
parallel software, but does not decompose each process to disjunctive partitions. 
On the other hand, [10] creates a large number of disjunctive partitions. The 
methods of [2] and [4] are not automated and require additional information from 
the user. We introduce a new method applicable to software models in which the 
decomposition is generated automatically, without additional information from 
the user. The number of disjunctive partitions created is similar to that of the 
conjunctive partitions for the same model, and the BDD size of the disjunctive 
partitions is comparable to that of the conjunctive partitions. 

Software has the feature that in each step there is little change in the program 
variables. It is quite easy to build a model for software where each step changes 
only the pc (program counter), and at most, one additional state variable. We 
present a modeling language called ODL, which is natural for defining such 
models. We also present a method for translating from conjunctive partitions to 
disjunctive partitions and vice versa. These translations can be easily adapted 
by any symbolic model checker that uses conjunctive partitioning and by doing 
so, may benefit from the advantages of disjunctive partitioning. 

In the traditional image computation algorithm, each disjunctive partition 
must represent the next value for all variables, so the disjunctive partition of 
state variable x should indicate the change of x and pc, and the fact that all 
other variables keep their value. The latter information might severely impact 
the BDD size of the partition. In this work, we change the image and pre-image 
computation in such a way that they can work on the partial disjunctive partition 
of X, which represents only the changes of x and pc, and not the fact that all other 
variables keep their value. Using this algorithm decreases the BDD size needed 
to represent the disjunctive partitions and improves the image computation. 
This method is applicable not only for software models, but also to some other 
methods ( [8], [2] and [4]) based on the fact that only a subset of the variables 
in the model can change their value in each disjunctive partition. 

Finally we suggest two schemes for distributed model checking that use the 
disjunctive partitioning. 
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In our work we implemented the translation from conjunctive partitioned 
transition relation to disjunctive partitioned transition relation. We show that 
the size of the partial disjunctive partitions is equal to, or even smaller than, the 
size of the conjunctive partitions. In addition, we show that calculating reacha- 
bility analysis using disjunctive partitions significantly outperforms calculation 
using conjunctive partitions. 

The remainder of this paper is structured as follows: Section 2 states the 
preliminaries. Section 3 presents the generation of the model from the software 
and the ODL modeling language. Section 4 presents the translation between 
conjunctive and disjunctive partitions, and vice versa. Section 5 introduces par- 
tial disjunctive partitions and their advantages, and Section 6 presents the dis- 
tributed version. In Section 7 we present some experimental results. We conclude 
and suggest some directions for future work in Section 8. 



2 Preliminaries 

A finite program can be modeled by a Kripke structure M over a set of atomic 
propositions AP. M = {S,Sq,R,L), where S' is a finite set of states. So is a 
set of initial states, i? C S x S is a total transition relation, and T : S — >■ 2^^ 
is a labeling function that labels each state with the set of atomic propositions 
that are true in that state. The states of the Kripke structure are coded by a 
set of state variables v. Each valuation to ti is a state in the structure. Model 
checking is a technique for verifying finite state systems represented as Kripke 
structures. The basic operations in model checking are the image computation 
and the pre-image computation. Given a set of states S and a transition relation 
i?, represented in symbolic model checking by the BDDs S{v) and R{v,v') re- 
spectively, the image computation finds the set of all states related by R to some 
state in S and the pre-image computation finds the set of all states such that 
some state in S is related to them by R. More precisely, image{S{v),R{v, v')) = 
3v{S{v) A R{v,v')) and preJmage{S{v'), R{v,v')) = 3v'{S{v') A R{v,v')). The 
result of image{S{v), R{v,v')) is over v'. In order to get the result over v, all 
BDD variables are “unprimed” . 

A conjunctive partitioned transition relation is composed of a set of partitions 
and-Ri such that R{v,v') = /\^and-Ri{v,v'). In case each state variable can 
be described by a single conjunctive partition (as in this work), we have that 
and-Ry. = {v[ = fvi(v)) and thus each partition is a function of v and w' 
rather than v and v'. The image computation in this case is image{S(y)) = 
3v{S{v) A (A^, and^Ry,{v,v[))). 

Computing 3xA(v) is referred to as quantifying x out of A. Early quantifica- 
tion [8] can make image and pre Jmage computations even more efficient. Early 
quantification is done by quantifying a variable x out of the intermediate BDD 
result, after conjuncting the last conjunctive partition that is dependent on x. 
Quantifying a variable out of the intermediate BDD may reduce the size of the 
BDD and as a result make the image computation easier. 
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A disjunctive partitioned transition relation is composed of a set of dis- 
junctive partitions or_Ri such that R{v,v') = \/ ^or-Ri{v,v'). In the case 
where each state variable can be changed only in a single disjunctive par- 
tition, we have that = (v' = fvi(v)) A (Vy ^ Vi : y = y'). The 

image computation when using disjunctive partitions is done by calculating 
image{S{v)) = 3v{S{v) A (\/ ^, or_Ry.(v,v'))). Because existential quantifica- 
tion distributes over disjunction, we have that every quantification is “early”, 
and thus image{S{v)) = V„. 3u(S'(u) A or_i?„.(u, u')). Because the quantifica- 
tion is done “early” for every v in the disjunctive partitioning, all intermediate 
BDD results depend only on v', while when using conjunctive partitions the 
intermediate BDD results may depend both on v and v' . Thus, using disjunc- 
tive partitions usually results in smaller intermediate BDDs than when using 
conjuncting partitions. 

Note that as opposed to a conjunctive partition, the naive disjunctive parti- 
tion is dependent on the entire vector v' , rather than just a single u'. We return 
to this point later and show how to avoid it by modifying the image computation. 

Let A C S' be a set of states and let a; be a set of variables. We use the 
notation A\^ to indicate the projection of the set A onto x. That is: 

A|s = {sGS|3aGA such that s and a agree on all values of the variables in x}. 

3 Generating a Model from Software 

Previous work showed the application of symbolic model checking to general 
purpose software [5,6] by translating C source code to EDL (Environment De- 
scription Language), a dialect of SMV [9], which is the input language to the 
RuleBase model checker. EDL, like SMV, is naturally suited for building of con- 
junctive partitions. That previous work was based on a specially-built parser and 
was limited to a small subset of C. In this work, we build a similar model using 
a full-blown compiler front-end. The most important thing about this model is 
that it has the following structure. 

Definition 1. An or-structure is a Kripke structure in which for every two 
states s, s': if R{s,s') then s and s’ are different from each other only in the 
values of the pc and no more than a single additional state variable x. 

The model we build has a state variable for each global variable in the C 
code and a state variable named pc (program counter) that holds the value of 
the next statement to be performed. The model also has stacks to support local 
variables, functions and recursion, and some special variables to support arrays 
and pointers (without pointer arithmetic). The basics of the generation process 
are explained here using a simple example. Afterward, we will discuss the special 
treatment for pointers and arrays. 

The translation process first translates the C code to intermediate code. 
There are two reasons for using intermediate code: 1. It will ease the support of 
other input languages in the future. 2. It generates the pc in a way such that 
for each value of pc, a maximum of one memory location changes its value. One 
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rl 0 


19 


z rl 


22 




24 


r2 ■<— a: 


26 


pc ^ (r2 > 0)?29 : 27 


27 


pc <r- 53 


29 




35 


r3 2 


37 


r4 ■<— 5 


39 


r5 ■<— r3 + r4 


41 


z <— r5 


44 


rQ X 


46 


r7 <— r6 — 1 


48 


fc ■<— r7 


50 


pc -i— 22 


53 





(a) C code of div.c 



(b) The intermediate code of div.c 



Fig. 1. Example of translation from C to intermediate code. 



may object to using intermediate code because it increases the number of values 
pc can get, and therefore increases the number of states in the model. While this 
is true, the number of pc values is only multiplied by a small factor and herefore 
adds to the state variables only 2 or 3 bits, which are negligible. 

In Figure l.a we can see a fragment of a C program. The code has two global 
variables called x and z. This code is translated to an assembly-like intermediate 
code shown in Figure l.b. In the intermediate code, there is a list of instructions, 
each with a unique pc (program counter), listed at the beginning of each line. 
The pc is updated to the pc of the next line if not specified otherwise. The first 
two lines indicate the behavior for pc = 18 and pc = 19. This is the intermediate 
code generated for line 1 in the C code (z = 0). At pc = 18 the value 0 is inserted 
to rl, and rl is inserted to z in pc = 19 . Lines like the one for pc = 22, which 
don’t have any code, are used as jump targets and only update the pc to the pc 
of the next line. Lines for pc = 24 through 27 perform the while condition: first 
in pc = 24 X is inserted into r2 and then in pc = 26 it is checked if it is bigger 
than 0. A true answer sets the pc to 29 (enter the loop), while a false answer 
sets it to the pc of the next line, which in turn sets the pc to 53 - after the loop. 

Next we translate the intermediate code into a model. There are two possible 
translations: The first one is to translate the intermediate code to a language 
that has the style of a guarded transition system. Each transition is of the form: 
pc = PC\ {X ^ f{X, Y, Z) Ape ^ PC 2 )- The guard is always a condition 
about the value of the pc (each value of the pc has exactly one transition) and 
the transition changes the value of the pc and perhaps the value of one additional 
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'^define main^rl 0 

'^define mainjr2 x 

'^define mainjrZ z 

'^define mairurA 5 

'^define mam_r5 main^rZ + mam_r4 

'^define mainjrQ x 

'^define mainjrl main_r6 — 1 



pc — 19 
pc ^ 22 
pc — 26 




pc — 27 
pc — 29 
pc — 41 
pc — 48 
pc — 50 




{z mairi-rl A pc i— 22) 
(pc -<r- 26) 

(pc i— if {mainjr2 > 0) 
then 29 else 27) 

(pc -(r- 53) 

(pc <r- 41) 

(2 <— main^rb A pc ■<— 48) 
{x mainjrl A pc i— 50) 
(pc ■(- 22) 



'^define main.rl 0 

'^define main_r2 x 

'^define mainjrZ z 

'^define mainjrA 5 

'^define main_r5 main.rS + main_r4 

'^define main_r6 x 

'^define mainjrl mam_r6 — 1 



next{pc) i— 


case 


pc — 19 


22 


pc ^ 22 


26 


pc — 26 


i/ 


pc — 27 


53 


pc — 29 


41 


pc — 41 


48 


pc — 48 


50 


pc — 50 


22 


else : pc 





then 29 else 27 



esac\ 

next{x) ■<— case 

pc = 48 : mainjrl 
else : x 
esac\ 

next{z) ■<— case 

pc — 19 : mainjrl 
pc — 41 : mainjrl 
else : z 
esac\ 



(a) Model in ODL representation 



(b) Model in EDL representation 



Fig. 2. Example of div.c translation to EDL and ODL. 



state variable We refer to this language as ODL. The translation to ODL is 
presented in Figure 2. a. The other possibility is to translate the intermediate 
code to EDL (Figure 2.b). For both possibilites we model the registers using 
a '^define. In this way, the registers won’t use any bits in the model. This is 
possible because the intermediate code defines and uses each register only once. 

The translation to ODL is very simple. Each line in the intermediate code is 
translated to a guarded expression representing the changes for this value of the 
pc. For example in pc = 19, z gets mainjrl (the ‘^define that represents register 
rl), and pc is set to 22. In the EDL code, we need to gather all the assigns of 
a state variable to the same place. For instance, the code for next(z) includes 
assignments for the lines for pcs 19 and 41 of Figure l.b. Another difference is 
that in ODL it is implicit that every state variable that is not mentioned, keeps 
its value, while the EDL explicitly codes it. 

At first glance, it seems preferable to translate to ODL because it’s simpler 
to translate C code to ODL, and it is simpler to translate ODL to disjunctive 

^ Note that this transition may change a different variable depending on the value of 
other state variables. However, only one state variable will change its value at any 
one time. For instance, an assignment of the form a\i\ = 5 will change a[0] or a[l], 
etc., depending on the value of i. But only one array location will change at any one 
time. 
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partitions. But translating the C code to EDL allows us to use RuleBase to 
read EDL, build the conjunctive partitions, and perform pre-model-checking 
reductions. A reduction is simply a conservative abstraction, that is, one that 
preserves both positive and negative truth values. Conjunctive partitions are 
more natural for performing simple reductions such as constant propagation 
as well as other more sophisticated reductions performed by RuleBase. Thus, 
even if we did not have conjunctive partitions, we would want to build them 
and translate the result of the reduction back to disjunctive partitions. Thus, 
we present methods for translating from conjunctive to disjunctive partitioning 
and vice versa in order to enable flexibility in our tool. In practice, using the 
reductions and translating the reduced conjunctive partitioning to disjunctive 
partitioning indeed proved to be useful. In addition, analyzing the translations 
enables us to bound the size of the disjunctive partitions, with respect to the 
conjunctive partitions. 

3.1 Dealing with Pointers and Arrays 

Modeling pointers and arrays creates a problem, because in general an assign- 
ment to a variable X from an array or a pointer causes the variable X to be 
dependent on more memory locations than an assignment from a scalar. In a 
naive approach, the BDD size of the partition for X will be quite large, because 
of the dependence on multiple variables. Furthermore, the large number of vari- 
ables in a single partition results in many constraints on the BDD order for the 
entire model, which might result in a larger BDD size not just for the partition 
in question, but for the entire design. 

We solve this problem by using cut-points [3]. Our translation adds four 
variables for each array . For array ar we add: IJndeXar, l-arrayar, rJndeXar 
and r -array ar (the prefix l/r means that the array is in the left/right side of 
the assignment). We translate an assignment x = ar[i] to the three assignments 
described in Figure 3(a), and an assignment ar[i] = x to the three assignments 
described in Figure 3(b). 

When using this translation on code containing assignments x = ar[i]] x = 
O'T’ij]] y = «?'[*]; y = we get that rJndeXar is dependent on i and j, 

r-urrayar is dependent on rJndeXar and all ar cells, and x and y are dependent 
only on rMrrayar- Without cut-points, we would have had that both x and y 
are dependent on i,j and all cells of array ar. 

In pointers, the problem is even more severe because there are generally more 
memory locations that can be affected by a pointer dereference than cells in an 
array. Still, the same idea is useful for pointers. 

Note that using cut-points and ‘^defines for modeling registers causes a prob- 
lem when translating statements like x = a[i] + a[j\. We avoid this problem by 
splitting such statements into two: temp = a[z]; x = temp+ a[j]. 

Our translation has another attribute. An assignment such as a[a[t]] = 5 is 
translated in the intermediate code into two different accesses to the array, one 
to get a[t] and the second to assign to a[a[z]], so that our translation creates the 
code in Figure 3(c). 
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rjndexa = i 



rJndeXar = i 


LindeXar = i 


rmrraya = a[rJndexa] 


rMvrayar = ar[rJndeXar] 


Larrayar = x 


LindeXa = rmrraya 


X = r_arrayar 


ar[l_indexar] ~ Larrayar 


1 .array a = 5 






a[LindeXa] = Larraya 


(a) Translating x = ar[i] 


(b) Translating ar[i] = x 


(c) Translating a[a[*]] = 5 



Fig. 3. Translation of array expressions 



3.2 Splitting of Self- Assignment Statements 

Assignments statements in the code can be of two kinds: 

1. Self-assignment statement - Assignment to a variable x in which the assigned 
value is a function of x (e.g., x+ = y or x = x-\-w-\-z). Such an assignment 
can be further divided into two kinds: constant self-assignment statement 
where we update the variable with a constant (e.g., x* = 4, a: + +), and 
variable self- assignment statement (e.g. x+ = y, x = x *b-\- c). 

2. Foreign-assignment statement - Assignment to a variable x in which the 
assigned value is not dependent on the value of x. (e.g. x = y or x = w-\-z). 

In order to reduce BDDs size and achaive better performance we split variable 
self-assignment statements like x-\- = y into two: temp = x, x = temp -\- y. This 
split increases the number of pc values and adds one variable (for all splits) but 
improves the overall performance. The reason will be explaind in section 4.1. 
Constant self-assignment statements can remain as is. 

4 Translating between Disjunctive and Conjunctive 
Partitions 

In this section, we show how to build the disjunctive partition of a state variable 
X, or_R^{v,v'), from its conjunctive partition and-Rx{v,x') and vice versa. Our 
construction is applicable only to or-structures where each dereference, such as 
arrays and pointers, is broken by a cut-point. Let pc be the state variable that 
codes the program counter of the program and y be the state variables which 
are different from pc and x. 

Definition 2. dep states x{v) is a set of states such that for every 
s € depstatesx(v) there exists s' such that i?(s, s') and x has different values in 
s and s' . 

Intuitively, depstatesx{v) are all the states related to lines in the C program 
where x is assigned a value, except for the case where x is assigned the same 
value it had before the assignment. 
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Definition 3. depjpcsx{pc) is the set of pc values which are related to state- 
ments in which x may change 



Definition 4. The partial disjunctive partition of a state variable x, denoted 
by por_Rx{pc,x,y,x' ,pc'), is the disjunctive partition or_Rx{v,v') without the 
requirement that the variables in y are left unchanged. 

{or_Rx{v,v') = por_Rx{pc,x,y,x',pc') A{y = g)) 

4.1 Building Disjunctive Partitions from Conjunctive Partitions 

We now show how to build each disjunctive partition from the conjuctive parti- 
tion of the same state variable and the conjuctive partition of pc. 



Translation for x pc: First we show how to build or-Rx{v,v') for x g pc. 

1. Calculate dep states x{v): 

depstatesx{v) = 3x' {and-Rx{v,x') A{xg x')). 

2. Intersect the quantification of x from depstatesx{v) with the conjunctive 
partitions of x and pc\ 



por_Rx{pc,x,y,x' ,pc') = 

= {3x{dep states x{v))) A and-Rx{v,x') A and-Rpc{v,pc') 

3. Intersect por-Rx{v,x' ,pc') with y = g to indicate that the other variables 
do not change: 

or.Rx{v, v') = por_Rx{pc, x, y, x',pc') A {y = g) 

We use depstatesx{v) in our construction and not depjpcsx{pc) because two 
states in which the pc value is identical do not necessarily change the same state 
variable. For example, consider the C statement a[i] = 5 and assume that it is 
related to pc = 7. For each value of i this statement changes a different state 
variable. Thus, the value pc = 7, which is related to this statement, will be 
in more than one disjunctive partition. If we had used depjpcsx{pc) the state 
{pc = 7;f = 2} would have been both in the partition of a[2] and a[l]. As a 
result, after conjuncting the disjunctive partition of a[l] with y = g it would 
have contained another transition, that does not exist in the original model and 
changes only pc and not a[l] or a[2]. This transition would have been entered 
to the disjunctive partition of a[l] because a[2] is in y. The quantification that 
appears in por_Rx{pc,x,y,x' ,pc') is discussed in detail later. 

^ X may not always change its valne in a certain pc. For example, when a: is a cell in 
an array, a[0], and the assignment is a[i] = 5, a[0] is assigned a valne only if i = 0 
and stays unchanged otherwise. 
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Translation for pc: Calculating por-Rpc{pc,x,y,pc') is a bit different. 

1 . Calculate depjpcsx (pc) for each x ^ pc: 

depjpcsx{pc) = dep-statesx{v)\pc 

2. Calculate the set of pc values jump-pcs{pc) that are related to statements 
in which pc is the only state variable that is changed. These pc values are 
related to statements in which there is a control branch like an if statement. 

jump 4 >cs{pc) = A {dep 4 >csx{pc)) 

x^pc 

3. Intersect and-Rpc{v,pc') with jumpjpcs{pc) to get the value of pc' for this 
pc value. 

por-Rpc{pc,x,y,pc) = jump-pcs{pc) A and-Rpc{v,pc) 

4. Intersect por_Rpc{pc,x,y,pc') with y = y' , where y is all variables that are 
different from pc. 

or-Rpc{v, v') = por^Rpc(pc, x, y,pc') A (y = y') 

Discussion: The general idea is that transitions in which only the pc changes 
should be in the partition of the pc, and transitions in which both the pc and 
some variable x change should be in the partition of x. Naively, this means that 
a line with some assignment would appear in the partition of the variable being 
assigned, while a line without an assignment would appear in the partition of 
the pc. However, things are not so simple. Consider the assignment x = 5. If a; 
has the value 5 before the assignment, then a transition from this line changes 
only the pc. If x has another value before the assignment, then this line changes 
both X and the pc. A naive construction of the or-partitions from the and- 
partitions would put the transition from a state where x has the value 5 into the 
partition of the pc, rather than into the partition of x. We would like to put this 
transition into the partition of x, because in this way the HDDs will be in some 
sense “cleaner” - that is, we hope that the HDD size will be smaller. Two other 
related problems are the case of assignments of the form x+ = y, where y has 
the value 0, and the case of assignments to a[i] for some array a, where i is out 
of the array bounds. Our method deals with such cases as explained below. 

In order to deal with assignments of the form x = 5 we quantify x 
out of dep states x{v) before conjuncting the result of the quantification with 
and-Rx{v,x') and and-Rpc{v,pc'). By doing so we ignore the value of x before 
the assignment. 

The need to deal with assignments of the form x+ = y (for which y = 0 may 
cause a problem in a naive construction) is the source of the splitting of variable 
self-assignment statements into two, as described in 3.2 above. This way, we 
avoid dealing with such assignments in the construction itself. 
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The problem with assignments such as a[i] = 5 needs some explanation. 
Consider an array a[0..2] of size three and a statement a[i\ = 5, where i equals 
7. Because a [7] is not a real variable in the program, there is no corresponding 
state variable in our model (otherwise the model would have been unbounded) . 
Thus, in such a case, in our model only the pc is changed, and the conjunctive 
partitioned transition relation contains a transition which changes only the pc. 
But this statement is related to transitions that do change variable values (for 
f < 3), and thus does not “belong” in the partition for the pc (according to our 
notion of “cleanness” ) . It is possible to overcome this problem by adding a new 
overflow variable to the model, the disjunctive partition of which will capture 
this behavior. 

Finally, we note that in the general case, our translation does not work for 
statements such as a[i] = a[5] or a[a[z]] = a[i] + 1. However, when the model is 
generated, as we suggested in section 3.1, such statements are always split up 
into several statements and therefore the problem is avoided. 



4.2 Building Conjunctive Partitions from Partial Disjunctive 
Partitions 

We previously discussed how to build a disjunctive partition from a conjunctive 
partition. In this subsection, we present the translation in the opposite direction. 

1. We first calculate dep 4 >csx{pc) simply by looking at the pcs that appear in 
por_Rx{v,x' ,pc') 



depjpcsx{pc) = porJlx{v,x' ,pc')\pc 

2. Now we can calculate and-Rx{v,x'). It is formed from a union of two sets: 
the states in which x changes its value and the states in which x saves its 
value. 

and-Rx{v,x') = (3pc'{por_Rx{v,x' ,pc')) V {depjpcsx{pc) A x = x') 

3. Now we can calculate and-Rpc{v,pc'). It is calculated by gathering the tran- 
sition pc to pc' in all the partial disjunctive partitions of the variables and 
conjuncting it with por_Rpc{v,pc'). 

and-Rpc{v,pc') = por-Rpc{v,pc') V ( V por_Rx{v,pc' ,x')\(^pc^pc')) 

x^pc 



5 Using Partial Disjunctive Partitions 

In the previous section, we showed how to calculate disjunctive partitions. Using 
this, we can take advantage of the superior efficiency of disjunctive partitioning. 
However, if the sizes of the disjunctive partitions are larger than the correspond- 
ing conjunctive partitions it is not certain that we have gained anything. In this 
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section we examine the answer to this question. First let’s look at or-Rx{v,v'). 
By definition, or_Rx{v,v') = por_Rx{pc,x,y,pc', x') A (y = y'). It is possible to 
build an example in which \or_Rx{v,v')\ = 0{n- \por_Rx{pc,x,y,pcf,x')\), where 
n is the number of state variables. An example is the assignment x y, where 
X is the first variable in the BDD order after pc and y is the last state variable 
in the BDD order. 

In order to avoid this factor, we do not calculate or.Rx{v,v'). We calculate 
only por -Rx{pc, x, y, x' ,pd) and rewrite the procedures that calculate image and 
preJmage operations in such a way as to use por_Rx{pc,x,y,x',pd) instead of 
or_Rx{v, v'). In the next subsection, we present the new algorithm for image and 
preJmage computation and prove its correctness. After, that we will bound the 
size of por _Rx{pc, x, y, x' ,pc'). 

5.1 Image and Pre Image Computations Using Partial Disjunctive 
Partitions 

When computing image(preJmage) using disjunctive partitions, it is possible to 
calculate the image(preJmage) on each disjunctive partition independently and 
then union the results. In this subsection, we introduce how to compute image 
or preJmage when only por_Rx{pc,x,y,pc' ,x') is given for each variable x. 

Lemma 1. prejimage{S{pc' ,x' ,y'),or_Rx{pc,x,y,pc' ,x' ,y')) = 

pre-image{S{pc , x' , y),por.Rx{pc, x, y,pc , x')) 

From this lemma, we get a simple algorithm that in the first step unprimes y' 
in S{pc',x',y') (linear in the size of the BDD), and then performs the ordinary 
preJmage algorithm on the result. The proof of this lemma is given in the full 
version of this paper. 

Lemma 2. image{S{pc,x,y),or-Rx{pc,x,y,pc',x',y')) = 

image{S{pc, x, y),por.Rx{pc, x, y,pc\ x')) 

Here again, we have a simple algorithm. First prime y in S{pc,x,y) and in 
por -Rx{pc, x, y,pc' , x') and then calculate the image using the results. The proof 
is almost the same as of the previous lemma. 

5.2 Bounding the Size of the Partial Disjunctive Partitions 

In this subsection, we bound the size of partial disjunctive partitions. The proofs 
of these claims are long, technical, and tedious. Proof sketches are given in the 
full version of this paper. Despite the relatively large upper bound, in practice, 
these extreme examples are rare. See Section 7 for experimental results. 

Since every variable is dependent on pc, it seems wise to place pc as the first 
state variable in the BDD ordering. All of the following lemmas assume that the 
BDD ordering follows this idea. 
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We define por_Rx{v, x') to be por-Rx{v, x',pc') without the condition on the 
value of pc' : 

por_Rx{v,x') = {3x{dep-states{x,v))) A and-Rx{v,x'). 



We can now rewrite the definition oi por_Rx{v,x' ,pc') using por_Rx{v,x'): 
por-Rx{v,x' ,pc) = por-Rx{v,x') A and-Rpc{v,pc). 

The following lemmas will first bound the size of por_Rx{v,x') and only then 
the size oi por_Rx{v,x' ,pc'). 



6 Scalability for Distributed Model Checking 

We now turn to the scalability of disjunctive partitioning. We claim that sym- 
bolic model checking with disjunctive partitioning is not only more efficient than 
with conjunctive partitioning, it also scales better. This is a direct result of the 
fact that quantification distributes over disjunctive partitioning, but not over 
conjunctive partitioning. Since image{S{v)) = \/ ^3v{S{v) /\or_Rx{v,v')), when 
using disjunctive partitions or partial disjunctive partitions we can calculate the 
image using one partition on each processor including quantification and then 
union the results of all processors. Because the image computation may be ex- 
ponential in the number of BDD nodes and the union operation is linear in the 
number of BDD nodes, distributing the partitions between n processors divides 
the “heavy” work by n. Note that when image computation is done distributively 
using conjunctive partitions it requires another step in which the partial results 
are “anded” together before quantification. Thus, the work done after all the 
processors have calculated their results may still be exponential in the number 
of BDD nodes. We now suggest two distributed algorithms for disjunctive par- 
titions. The first algorithm is simple and uses a master and several slaves. The 
master will send S{v) to all the slaves and start sending each idle slave a dis- 
junctive partition. Each slave that gets a disjunctive partition will perform the 
image computation with this partition and union it with previous computations 
it made. When there are no more partitions and all slaves are idle, the master 
will gather all the slaves’ results and union them. Reachability computation is 
then performed by repeated image computations of the former algorithm. One 
drawback with this scheme is that while the server computes the union of all the 
slaves’ results, the slaves are idle. 

The second algorithm avoids this problem. In this algorithm, each process Pi 
is responsible for several partitions TRi, and has its own reachability set RSi. 
There is also a (shared) queue of sets of states and each process has two pointers 
to this queue: a shared pointer for entering sets to the queue and a private one for 
reading from the queue. As a result, all processors read all the sets that enter the 
queue. At the beginning the queue has the initial set of states. Each process Pi, 
at each iteration takes the next set S from the queue (according to its pointer), 
removes from it the parts it already handled S = S\ RSi and adds the result to 
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Example 


Num of vars 


Conjunctive partitions 


Disjunctive partitions | 


Reachability 

time 


Maximal step 
time 


Reachability 

time 


Maximal step 
time 


simple 


505 


11024 s 


95.7 s 


23.5 s 


0.54 s 


factorial 


159 


31.8 s 


0.9 s 


0.11 s 


0.01 s 


insert sort 


197 


264.6 


3.9 s 


15.23 s 


0.08 s 


quick sort 


282 


10197 s 


10 s 


172 s 


0.8 s 


merge sort 


654 


952 s 


7.77 s 


0.62 s 


0.04 s 


pointer quick sort 


693 


1546 s 


5.8 s 


57 s 


1.8 s 


pointer merge sort 


716 


> 8 h 


> 99 s 


78 s 


0.25 s 



Fig. 4. Comparison of reachability computation using conjunctive partitions against 
using partial disjunctive partitions. 



RSi, then calculates the image of S using TRi getting imagci = image{S,TRi). 
In order to continue only with the new states, the reachable states are removed 
from imagCi getting newi = imagCi \ RSi. In the case where newi yf 0, it is put 
in the next entry of the queue. When all processors are trying to read from the 
queue and they are all pointing to an empty slot in the queue, the algorithm has 
ended. At the end, each process has the whole reachability set because it saw all 
the image computation results of all processes in the queue and no new set of 
states is entered to the queue. 

7 Experimental Results 

We implemented the translation from conjunctive partitioned transition relation 
to partial disjunctive partitioned transition relation in the IBM model checker 
RuleBase [1]. We compared reachability analysis using conjunctive partitions 
with reachability analysis using partial disjunctive partitions on models that 
were translated from software programs. These software programs were written 
in C and contain pointers and arrays. In both cases, we applied dynamic BDD 
reordering. In order to obtain a fair comparison between these algorithms, we 
ran each one twice. In the first run, the algorithm reordered the BDD with no 
time limit in order to find a good BDD order. The initial order of the second 
run was the BDD order found by the first run. The partial disjunctive parti- 
tioning outperforms the conjunctive partitioning with respect to execution time, 
as shown in Figure 4. We compared the sizes of partial disjunctive partitions 
with those of conjunctive partitions under the same BDD order. The table in 
Figure 5 shows the maximal and minimal ratios between a specific variable par- 
tial disjunctive partition size and its conjunctive partition size. We specifically 
note the ratio of the pc variable and the size of its partial disjunctive partition. 
In addition, we show the maximal conjunctive partition and maximal partial 
disjunctive partition not including pc. We observed that the partial disjunctive 
partitions were in the same order of magnitude or even smaller than the con- 
junctive partitions. This was achieved by the use of partial disjunctive partitions 
instead of ordinary disjunctive partitions. In our experiments we found that the 
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Examples 


# 

Vars 


Relations between partitions size 


Partitions 


size 


Min 

disj/conj 


Max 

disj/conj 


pc 

disj/conj 


Disj 

pc 


Max 

conj 


Max 

disj 


simple 


505 


0.65 


1.54 


1.00 


8101 


10788 


10777 


factorial 


159 


0.53 


1.27 


1.00 


3562 


1447 


1433 


insert sort 


197 


0.58 


1.38 


0.98 


3201 


360 


390 


quick sort 


282 


0.53 


1.34 


1.00 


12595 


1422 


1124 


merge sort 


654 


0.47 


1.49 


1.00 


7925 


8346 


8341 


pointer quick sort 


693 


0.46 


1.71 


1.00 


17650 


62225 


52155 


pointer merge sort 


716 


0.35 


1.30 


0.99 


6861 


66987 


32145 



Fig. 5. Comparison between size of conjunctive partitions and partial disjunctive 
partitions. 



size of each ordinary disjunctive partition {or_R^(v,v')) was up to 84 times the 
size of it corresponding partial disjunctive partition. 

8 Conclusions and Future Work 

Using partial disjunctive partitions seems to be a successful and natural scheme 
for software models. In this work, we show how to apply disjunctive partitioning 
to software models while keeping the partitions small. We also show how to 
enhance the image and pre-image computation to support our partial disjunctive 
partitions and make model checking algorithms more efficient. However, this is 
only the beginning and there are a number of directions for future work. As 
we note above, we handle variables with a large number of bits by creating a 
single partition for each variable containing the behaviors of all its bits. Future 
work will explore the possibility of implementing the DNF partitioned transition 
relation [4], where the disjunctive partition of a state variable is composed of 
conjunctive partitions of its bits. 

As we claimed in Section 6, disjunctive partitioned transition relation is nat- 
ural for distributed algorithms. It seems wise to implement and explore both 
algorithms presented in that section. Special attention should be given to find- 
ing a good distribution of the disjunctive partitions over the processes in order 
to achieve good load balancing. 

Acknowledgments. We thank Cindy Eisner, Yoad Lustig and Ziv Nevo for 
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Abstract. In the VAMP (verified architecture microprocessor) project we have 
designed, functionally verified, and synthesized a processor with full DLX instruc- 
tion set, delayed branch, Tomasulo scheduler, maskable nested precise interrupts, 
pipelined fully IEEE compatible dual precision floating point unit with variable 
latency, and separate instruction and data caches. The verification has been carried 
out in the theorem proving system PVS. The processor has been implemented on 
a Xilinx FPGA. 



1 Introduction 

Previous Work. Work on the formal verification of processors so far has concentrated 
mainly on the following aspects of architectures: 

i) Processors with in-order scheduling, one or several pipelines including forwarding, 
stalling and interrupt mechanisms [3,13,28] . The verification of the very simple, non- 
pipelined FM9001 processor has been reported in [2]. Using the flushing method 
from [3] and uninterpreted functions for modeling execution units, superscalar pro- 
cessors with multicycle execution units, exceptions and branch prediction [28] have 
been verified by automatic BDD based methods. Also, one can transform specifica- 
tion machines into simple pipelines (with forwarding and stalling mechanism) by 
an automatic transformation, and automatically generate formal correctness proofs 
for this transformation [15]. 

ii) Tomasulo schedulers with reorder buffers for the support of precise interrupts [5,8, 
16,24]. Exploiting symmetries, McMillan [16] has shown the correctness of a pow- 
erful Tomasulo scheduler with a remarkable degree of automation. Using theorem 
proving, Sawada and Hunt [24] show the correctness of an entire out-of-order pro- 
cessor, precise interrupts, and a store buffer for the memory unit. They also consider 
self-modifying code (by means of a sync instruction). 

* The work reported here was done while the author was with Saarland University. 

** Research supported by the DEG graduate program ‘Effizienz und Komplexitat von Algorithmen 
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iii) Floating point units(FPU). The correctness of an important collection of floating 
point algorithms is shown in [21,22] using the theorem prover ACL2. Correctness 
proofs using a combination of theorem proving and model checking techniques for 
the FPUs of Pentium processors are claimed in [4,19]. As the verified unit is part of 
an industrial product not all details have been published. Based on the constructions 
and on the paper and pencil proofs in [18] a fully IEEE compatible EPU has been 
verified [1,11] (using mostly but not exclusively theorem proving). 

iv) Caches. Multiple cache coherence protocols have been formally verified, e.g., [6,17, 
25,26]. Paper and pencil proofs are extremely error prone, and hence the generation 
of proofs for interactive theorem proving systems is slow. The method of choice is 
model checking. The compositional techniques employed by McMillan [17] even 
allow for the verification of parameterized designs, i.e., cache coherence is shown 
for an arbitrary number of processors. 

Simplifications, Abstractions, and Restrictions. Except for the work on floating point 
units, the cache coherence protocol in [6], and the EM9001 processor [2], none of the 
papers quoted above states that the verified design actually has been implemented. All 
results cited above except [1,2,6,11] use several simplifications and abstractions: 

i) The realized instruction set is restricted: always included are the six instructions 
considered in [3] : load word, store word, jump, branch equal zero, three register ALU 
operations, ALU immediate operations. Five typical extra instructions are trap, return 
from exception, move to and from special registers, and sync [24]. The branch equal 
zero instruction is generalized in [28] by an uninterpreted test evaluation function. 
Most notably the verification of machines with load/store operations on half words 
and bytes has apparently not been reported. In [27] the authors report an attempt 
to handle these instructions by automatic methods which was unsuccessful due to 
memory overflow. 

ii) Delayed branch is replaced by non-deterministic speculation (speculating branch 
taken/not taken). 

iii) Sometimes, non-implementable constructs are used in the verification of the pro- 
cessors: e.g., Hosabettu et.al. [8] use tags from an infinite set. Obviously, this is not 
directly implementable in real hardware. 

iv) The verification of the FPUs does neither cover the handling of denormal numbers 
nor of exception flags. The verification of a dual precision FPU has not been reported 
(though, obviously, Intel’s and AMD’s FPUs are capable of dual precision). 

v) No verification of a memory unit with caches has been reported. Eiriksson [6] only 
reports the verification of a bit-level implementation of a cache coherence protocol 
without data consistency. 

vi) The verification of pipelines or Tomasulo schedulers with instantiated floating point 
units and memory units with caches and main memory bus protocol has not been 
reported. Indeed, in [27] the authors state: “An area of future work will be to prove 
that the correctness of an abstract term-level model implies the correctness of the 
original bit-level design.” 

Results and Overview. In the VAMP (verified architecture microprocessor) project we 
have designed, functionally verified, and synthesized a processor with full DLX in- 
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struction set, delayed branch, Tomasulo scheduler, maskable nested precise interrupts, 
pipelined fully IEEE 754 [9] compatible dual precision floating point unit with variable 
latency, as well as separate, coherent instruction and data caches. We use only finite 
tags in the hardware. Thus all abstractions, restrictions and simplifications mentioned 
above have been removed. Specification and verification was performed using the in- 
teractive theorem proving system PVS [20]. All formal specifications and proofs are 
on our web site.' The hardware description was automatically extracted from PVS and 
translated into Verilog HDL by a tool sketched in section 7. Hardware with non verified 
rudimentary software is up and running on a Xilinx EPGA. The Verilog design can also 
be downloaded from our web site. 

In section 2, we summarize the fixed point instruction set, its floating point extension, 
and the interrupt support realized. We give a micro-architectural overview with a focus 
on the memory system. Section 3 describes the correctness criterion, the main proof 
strategy, and the integration of the execution units into the Tomasulo core. Correctness 
criterion and proof strategy are based on scheduling functions [14,18] (similar to the 
st^-component of MAETTs [23]). The model of the execution unit is in a nontrivial way 
more general than previous models without complicating interactive proofs too much. 

Section 4 presents a delayed branch mechanism, which is automatically constructed 
and proven correct by the methods for automatic pipeline construction from [15] and 
summarizes the specification of an interrupt mechanism for maskable nested precise 
interrupts and delayed PC from [18]. Section 5 deals with the integration of the floating 
point unit from [11] into our Tomasulo scheduler. Section 6 deals with loads and stores 
of double words, words, half words, and bytes at a 64 bit cache/memory interface. We 
also sketch correctness proofs of the implementation of a simple coherence protocol 
between data cache and instruction cache, as well as the implementation of a main 
memory bus protocol. Section 7 describes the implementation of the VAMP on a Xilinx 
EPGA. Section 8 gives an overview of the verification effort for various parts of the 
project, summarizes our work, and sketches directions of some future work. 



2 Overview of the VAMP Processor 

Instruction Set. The full DLX instruction set from [7] is realized. This includes loads 
and stores for double words, words, half words, and bytes, various shift operations, and 
two jump-and-link operations. Loads of bytes and half words can be unsigned or signed. 
In order to support the pipelining of instruction fetches, delayed branch with one delay 
slot is used. Note that delayed branch changes the sequential semantics of program 
execution. 

The floating point extension of the DLX instruction set from [18] is supported. The 
user sees a floating point register file with 32 registers of single precision numbers as well 
as a single floating point condition code register ECC. Pairs of floating point registers can 
be accessed as registers for double precision numbers (with an even register address). 
Supported operations are: i) loads and stores for singles and doubles, ii) -F, — , x , F- both 
for single and double precision numbers, iii) test-and-set, the result is stored in ECC. 

* http://www-wjp.cs.uni-sb.de/forschung/projekteWAMP/ 
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Fig. 1. Main data paths of the VAMP processor 

iv) conditional branches as a function of FCC. v) conversions between singles, doubles 
and integers, vi) moves between the general purpose register file and the floating point 
register file. Operations are fully IEEE compatible [9]. In particular, all four rounding 
modes, denormal numbers, and exponent wrapping as a function of the interrupt masks 
are realized. 

Interrupt Support. Presently, the interrupts from table 1 in section 4 are supported. In- 
terrupts are maskable and precise. Floating point interrupts are accumulated in 5 bits of 
a special purpose register lEEEf (IEEE flag) as prescribed by the IEEE standard. All 
special purpose registers (details in section 4) are collected into a special purpose reg- 
ister file. Operations supporting the interrupt mechanism are: i) moves between general 
purpose registers and special purpose registers, ii) trap, iii) return-from-exception. 

Microarchitecture Overview. Figure 1 gives a high level overview of the VAMP mi- 
croarchitecture. Stages IF and ID are a pipelined implementation of delayed branch 
as explained in section 4. Stages EX, C and WB realize a Tomasulo scheduler with 5 
execution units, a fair scheduling policy on the common data bus CDB, and a reorder 
buffer ROB (for precise interrupts). The execution units are i) MEM: memory unit with 
variable latency and internal pipelining. There is presently no store buffer, ii) XPU: the 
fixed point unit, iii) FPUl to FPU3: specialized pipelined floating point units with vari- 
able latency. FPU 1 performs additions and subtractions. FPU2 performs multiplications 
and divisions. FPU3 performs test-and-set as well as conversions. The data output of the 
reorder buffer is 64 bits wide. The floating point register file FPR is physically realized 
as 16 registers, each 64 bits wide. The general purpose registers file GPR and the special 
purpose register file SPR are both 32 bits wide, and have 32 and 9 entries, respectively. 
They are connected to the low-order bits of the ROB output. 
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Fig. 2. Data paths of the VAMP memory unit 

Figure 2 depicts a simplified view of the memory unit. Internally, it has two pipeline 
stages. The first stage does address and control signal computations. The second stage 
performs the actual data cache access via signals adr, din, and dout. Instructions are 
fetched from the instrnction cache via signals pc and inst. The memory interface Mif 
internally consists of a data cache, an instruction cache, and a main memory. The caches 
are kept coherent (this does not suffice to guarantee correct execution of self-modifying 
code). Details are explained in section 6. 

3 Correctness Criterion and Tomasulo Algorithm 

Notations. We consider a specification machine S and an implementation machine /. 
Configurations of these machines are tuples, whose components Rs and i?/, respectively, 
are registers or memories. Register contents are bit strings. Memory contents are modeled 
as mappings from addresses (bit strings) to bit strings. For example, PCs denotes the 
program counter of the specification machine, and memj denotes the main memory of 
the implementation machine. 

The specification machine processes a sequence of instructions . . . at the 

rate of one instruction per step. We denote by Rg the content of component R before 
execntion of instrnction R . One step of the implementation machine is a hardware cycle, 
and we denote by the content of component R during cycle T. The fetch of the 4 
bytes of an an instruction into the instruction register IR of the implementation machine 
dnring cycle T can be specified by := memJ[PCj + 3 : PCj]. 

Although the instruction register is not a visible register, one can specify the desired 
content IRg of the instruction register for the specification machine for instruction R as 
a fnnction of the visible components by IRg = mendg [PCg + 3 : PCg] . Defining the 
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next configuration of the specification machine involves many such intermediate 
definitions, e.g., the immediate constant imnig, the effective address ea^, etc. Starting 
from the visible components we extend the configuration of the specification machine 

in this way by numerous (redundant) secondary components. 

Scheduling Functions. For hardware cycles T and pipeline stages k of the implementa- 
tion machine, we formally define an integer valued scheduling function sI{k,T) [14], 
where sI{k,T) = i has the intended meaning that an instruction li is during cycle T in 
stage k. 

By treating instruction numbers like integer valued tags,^ the definition of these 
functions is straightforward. We initialize sl{k, 0) := 0 for all stages. We then “clock” 
these tags through the pipeline stages under the control of the update enable signals^ 
uck for the output registers of stage k. If a stage is not clocked, the scheduling function 
is not changed, i.e., sI{k,T) := sI{k,T — 1) if /ue^~^. Note that we introduce 
separate “stages” k for each reservation station and ROB entry. 

For the fetch stage"^, e.g., we define sI{fetch,T) := sI{fetch,T — 1) -f 1 if 
ue^~tlh, meaning that the content of the fetch stage progresses by one instruction in the 
instruction stream /g, /i, . . . If stage k receives data from stage k' in cycle T, we define 
sl{k, T) := sl{k' , T — 1). Note that this covers the case that a stage can receive data 
from two different stages and k” , since in a fixed cycle T, it receives data from only 
one of these stages. This occurs at the ROB, e.g., where we allow bypassing branch 
instructions from the instruction register directly into the ROB without going through an 
execution unit. Thus, the ROB can receive data from the CDB and from the instruction 
register. 

As a form of bookkeeping for the memory unit, we introduce an additional “stage” 
mem' . The corresponding scheduling function sl{mem' , T) equals sl{mem, T) if the 
memory unit is empty or the instruction in the unit has not accessed the main memory 
yet. Otherwise, we set sl{mem' ,T) := sI{mem,T) + 1. We need this bookkeeping 
function in order to model whether the memory is already updated by a store instruction. 

Correctness Criterion. We are interested in the content of the main memory mem and 
the register files RF G {GPR, FPR, SPR} after certain instructions respectively 
before instruction li+i. The main memory is an output “register” of stage mem and 
the register files are output “registers” of stage wb. The functional correctness criterion 
requires an instruction li in stage mem' of the implementation machine / to see the 
same memory content as the corresponding instruction of the specification machine S'; 
formally memj = The corresponding condition for register files RF 

is RFJ^ = In general, we prove by induction on T for all stages k and 

all output registers R of stage k that Rj = Rg , where Rg can be a visible or 

^ Having integer valued tags is only a proof trick. In hardware, we only use finite tags. During 
the proof of correctness for the Tomasulo scheduler, we prove that these finite tags properly 
match to the infinite instruction number. 

^ Update enable signals are sometimes called ‘register activates’. They are used to (de-)activate 
updating of register contents. 

We introduce symbolic names for some stages fc, e.g., fetch and mem. 
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redundant component of the configuration of the specification machine. Note that for 
technical reasons, we claim for the instruction register that IR^ = 

The liveness criterion states that all instructions that are not interrupted reach the 
writeback stage. At the time of submission of this paper, we have separate formal liveness 
proofs for the scheduler and the execution units; we are currently working on combining 
them into a single formal liveness proof for the entire machine. 

Paper and pencil proofs for the correct- 
ness of Tomasulo schedulers tend to follow 
a canonical pattern: i) For instructions R and 
register operand R, one defines last{i, R) 
as the index of the last instruction before R 
which wrote register R. ii) One shows by in- 
duction that the formal definitions of tags and 
valid bits have the intended meaning. In our 
setting, this means that the finite tags in hard- 
ware correspond to the integer valued tags 
provided by the scheduling function si. iii) 
Finally, one has to show that the reservation 
station of instruction R reconstructs xj^e rest is easy. 

It is important to observe that the structure of these paper and pencil proofs and their 
formal (theorem proving) counter parts do not depend much on the fixed or variable 
latency of execution units or whether these units are pipelined. The scheduler recognizes 
instructions completed by the execution units simply by examining the tags returned from 
the units. The situation is very different for model checking [28]. 

Integration of Execution Units. The proofs for the scheduler and the proofs for the 
execution units are separated by the following specifications for the execution units [11, 

10] . Notations refer to figure 3. 

i) stallf^ => i.e., if fhe scheduler asserts the execution unif does 

not return a valid instruction. 

11) VT3T' > T : i.e., the stallout signal is never active indefinitely. 

iii) Instructions dispatched with tagi„ = tg at time T will eventually (at time T' > T) 

return a result with the same tag , i.e., = tg. Moreover, = f{dataf.f) 

where / is the (combinatorial) function the execution unit is supposed to compute. 

iv) For each time T at which a result with tag tg is returned, there is an earlier time 
T' <T such that an instruction with tag tg was dispatched at time T', and tag tg was 
not returned between T' and T. Hence, the execution units do not create spurious 
outputs. 

Note that the instructions do not need to leave the execution units in the order they enter 
the units; all FPUs, e.g., exploit this by allowing instructions on some special operands to 
overtake other instructions. Moreover, multiplications may overtake divisions (cf. [10] 
for details). 

The four conditions above must be shown for each of the execution units provided 
the scheduler guarantees the following three conditions: i) No instruction is dispatched 






stallout 
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Execution Unit 






PRODUCER 



Fig. 3. Model of an execution unit 
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to an execution unit which sends a stallout signal to its reservation station, ii) The 
execution units are not stalled forever by the producers, iii) Tag-uniqueness: no tag 
which is dispatched into an execution unit is already in use. 

4 Delayed Branch and Maskable Nested Precise Interrupts 

In the delayed branch mechanism, taken branches yield a new PC of the form PC + 
imm+4:, taken branches are delayed, and PC+S is saved to the register file during jump- 
and-link. In the equivalent delayed PC mechanism [14,18], one uses an intermediate 
program counter PC' with branch targets PC' +imm, all fetches use a delayed program 
counter DPC, and PC' + 4 is saved during jump-and-link. 

Figure 4 depicts a pipelined implementation of the delayed PC mechanism in the 
VAMP processor. This construction and its formal correctness proof are automatically 
obtained by the method for automatic pipeline construction from [15]. Indeed, fetching 
instructions from the intermediate program counter PC' is — not only intuitively but 
formally — forwarding of DPC. The role of the multiplexers above PC' and DPC are 
explained in the following paragraphs about interrupts. 

The formal specihcation of the interrupt 
mechanism for delayed PC is based on the defini- 
tions of [18, Chap. 5, 9.1]. Table 1 shows the sup- 
ported inten'upts.^ The special purpose registers 
for the interrupt mechanism are: i) status register 
SR for interrupt masks, ii) two registers ECA for 
exception cause and EData for parameters passed 
to the interrupt service routine, iii) two registers 
EPC and EDPC for return addresses for PC' 
and DPC and iv) a register lEEEf for the accu- 
mulation of masked floating point exceptions. 

At issue time of an instruction Ii, it is unknow 
whether Ii will be interrupted and whether the in- 
terrupt requires to repeat the interrupted instruc- 
tion or not. Therefore, we have to save two pairs 
of potential return addresses in the reorder buffer: 

{PC"g,DPC'g) for interrupts of type “repeat”, 
and the results of the uninterrupted next PC' and VAMP PC Environment 

next DPC computations (PC"g DPCg’''~^^) for interrupts of type “continue”. The 
data paths of the PC environment are shown in hgure 4. 

Interrupt handling in the specification machine S depends on the components ECA 
and EData. In the implementation, these two registers are treated as additional results of 
the execution units; thus, execution units have up to four 32-bit results. This affects the 
width of the ROB. The formal correctness of these components in the ROB at writeback 
time is asserted without additional verihcation effort by the consistency of the Tomasulo 
scheduler. Further lemmas are needed for the correctness of the PCs stored in the ROB. 
The return-from-exception instruction is treated like any other instruction; no special 
effort is needed here. 

^ Page fault signals are presently tied to zero. 
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Table 1. Implemented interrupts 



index 


name 


maskable 


type 


index 


name 


maskable 


type 


0 


reset 


no 


abort 


7 


FPU overflow 


yes 


continue 


1 


illegal instruction 


no 


repeat 


8 


FPU underflow 


yes 


continue 


2 


misalignment 


no 


repeat 


9 


FPU loss of accuracy 


yes 


continue 


3 


page fault on fetch 


no 


repeat 


10 


FPU division by zero 


yes 


continue 


4 


page fault load store 


no 


repeat 


11 


FPU invalid 


yes 


continue 


5 


trap 


no 


continue 


12 


FPU unimplemented 


no 


continue 


6 


arithmetic overflow 


yes 


continue 











Since the main memory is updated before writeback of an instruction, one has to 
guarantee that in case of an interrupt, all stores prior to the interrupted instruction are 
executed, but none of the instructions after it. Especially, one has to show that a store 
that has reached the writeback stage also has accessed the main memory, i.e., it did not 
enter the wrong execution unit. 

5 Floating Point Unit 

Execution Units. The FPUs and their verification are described in [11]. The construction 
and verification of the combinatorial circuits is based on the paper and pencil proofs 
from [18]. The internal control of the iterative unit for multiplication and division is 
complex: during cycles, when the division unit performs a subtraction step, the multiplier 
can be used by multiplication operations or by multiplication steps of other division 
operations. Moreover, operations with special operands are processed in a single cycle. 
Thus in general, the units do not process instructions in order, but that is not required by 
the specifications from section 4. We remark that we have formal proofs but no paper and 
pencil proofs for the correctness and liveness of the floating point control. The control 
was constructed and verified with the help of a model checker[10]. 

At first sight, floating point operations have two operands and one result. However, 
rounding mode (stored in a special purpose register RM) and interrupt masks (stored in 
SR) are two further operands of every floating point operation. 

Moreover, there is aliasing in connection with the addressing of the floating point 
registers: each single precision floating point register can be accessed by single precision 
operations as well as by double precision operations. The ISA does not preclude the 
construction of a double precision operand by two writes with single precision to the 
upper and lower half of a double precision register. It can be necessary to forward these 
two results from separate places whether the double precision operand is read. This is 
easily realized by treating the upper half and the lower half of double precision operands 
as separate operands. Thus, reservation stations for dual precision floating point units 
have 6 operands. 

IEEE Elags and Synchronization. The exception flags for interrupts 6 to 12 are part of 
the result of every floating point operation f. They are accumulated in special purpose 
register lEEEf during writeback of f . We have already seen in section 4 that this affects 
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the width of the reorder buffer. A move operation Ij which reads from register lEEEf is 
issued only after the entire reorder buffer is empty. This simple modification of the issue 
logic makes it very easy to prove that the flags of all floating point operations preceding Ij 
are accumulated when lEEEf is read by Ij . A move instruction from lEEEf to general 
purpose register 0, which is constantly 0, acts as a sync operation for self-modifying 
code as explained at the end of the following section. 



6 Memory Interface 

Loads and Stores with Variable Operand Width. The formal specification of the seman- 
tics of the memory instructions is based on the definitions in [18, Chap. 3]. Accesses 
are characterized by their effective address ea and their width in bytes d G {1, 2, 4, 8}. 
The access is aligned if ea mod d = 0. Effective addresses ea define a double word 
address da{ea) = [ea/8j and a byte address ba{ea) = ea mod 8. A simple “alignment 
lemma” states that for aligned accesses, the memory operand mem[ea -f d — 1 : ea] 
equals bytes \ha{ea) -f d — 1 : ba{ea)] of the double word addressed by da{ea) at the 
memory interface.® Details can be found in [18]. 

Circuits called shiftdload and shiftdstore are used in order to ensure that data is loaded 
and stored correctly. These circuits are shown in figure 2. “Shiff for store” denofes shifting 
fhe dafa, say the halfword which is to be stored, into the correct position of a double- 
word before it is sent to the 64-bit wide memory interface. Similarly, “shift for load” 
denotes extraction of the requested portion (say halfword) of the 64-bit delivered from 
the memory interface. Also, sign-extension is done during “shift for load” for signed 
byte- and halfword-loads. Shift for store and load are implemented by means of two 
simplified shifters with some control logic [18]. 

The proof of correctness of the VAMP memory interface is structured hierarchically. 
First, we verify the VAMP with an idealized memory interface mspec, a dual-ported 
memory without caches. Second, we show that a cache memory interface with split 
caches backed up by a unified main memory mJmpl behaves exactly like the dual- 
ported memory mspec. Thus, m.spec serves as the specification for the cache memory 
interface. By putting these two independent proofs together, we obtain the correctness 
of the VAMP with split caches with respect to the memory mems of the specification 
machine. 



Cache Specification and Implementation. The memory mspec is defined recursively, 
i.e., it is updated on the double word address a iff a write access to address a terminates. 
Separate byte-enables mwbb allow for updating only some of the 8 bytes stored on 
address a. Formally, we have for any byte 6 < 8 and any double word address a: 



m_spec[8 ■ a + : = 



din[b]'^ a = adr^ A muF A mwbj /\ /dbusy^ 

mspec[8 ■ a + b]'^ else 



The memory interface is implemented with split caches connected to a single main 
memory as depicted in figure 5. We use a write-back policy for the data cache, i.e., on a 

* Note that this specifies little endian memory organization. 




Instantiating Uninterpreted Functional Units and Memory System 



61 



write access of the CPU, the data cache is updated and the corresponding data is marked 
as dirty. Thus, a slow access to the main memory is avoided. If dirty data is to be evicted 
from the cache, it is written back to the main memory in order to ensure data consistency. 

The protocol used to keep the caches co- 
herent works as follows; If a cache signals a 
hit on a CPU access, the data is read directly 
from the cache or written to it, depending on 
the type of the CPU access. This allows for 
memory accesses that take only one cycle to 
complete. If, on the other hand, the cache sig- 
nals a miss, the corresponding data has to be 
loaded into the cache. The control first exam- 
ines the other cache in order to find out if it 
holds the required data. In this case, the data 
in the other cache is invalidated. If the data to 
be invalidated is dirty, this requires an addi- 
tional write back to the main memory. 

This consistency protocol guarantees ex- 
clusiveness, i.e., for any address, at most one 
of the two caches signals a hit. In this way, we 
ensure that on a hit of the instruction cache, 
the data cache does not contain newer data. 

The instruction and data caches are implemented as fc-way sectored set-associative 
caches using a LRU replacement policy. Cache sectors consist of 4 double words since 
the bus protocol supports bursts of length 4. 




Fig. 5. Cache memory interface 



Typical Lemmas. The inductive invariant used to show consistency of split caches as 
described above consists of three parts. Two of these parts are obvious: if the data or 
instruction cache, respectively, signals a hit, then its output data equals the specified 
memory content. However, an invariant consisting only of these two claims is not in- 
ductive since caches are reloaded from the main memory. Therefore, we need a third 
part of our invariant stating the consistency of data in the main memory. Thus, we also 
claim that on a clean hit or a miss in cycle t on address Dadr"’" in the data cache, the 
main memory rri-impl on this address Dadr^ contains the specified memory content. 
Note that on a clean hit in the data cache, we thus claim data consistency in both the 
data cache and the main memory. Formally, we have the following claim: 

Ihit'' Idout]})]^ = m_spec[S ■ ladr'^ -F 6]^A 

Dhit^ =F Ddout\bY = m_spec[8 • Dadr'^ -F 6]^A 

/{Dhit^ A dirty^) =F -F 6]^ = m_spec[8 • -F 6]^. 

This invariant is strong enough to show transparency of the whole memory interface 
since the data word returned to the CPU on a read access is just the cache output in case 
of a hit, or the data written to the cache during reload in case of a miss. Note that the 
invariant relies on the exclusiveness property of the protocol, which has to be verified 
as part of the proof of the invariant. 
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Fig. 6. 4-burst write timing diagram Fig. 7. Burst control FSD 

Bus Protocol. The main memory is accessed via a bus protocol featuring bursts. The bus 
protocol signals ready data by raising brdy one cycle in advance. A sample timing of a 
4-burst write is depicted in figure 6. Note that the data input din one cycle after brdy is 
written to the main memory and that the end of the access is signaled by /reqp A brdy. 

As part of our correctness proof for the memory interface, we have formalized this bus 
protocol and proved that an automaton^ according to figure 7 implements this protocol 
correctly by means of theorem proving. The main invariant for this proof is the following: 
in the cycle of the i-th memory access of the burst, i.e., after the i-th brdy, the automaton 
is in state mem for the i-th time. In the cycle of the last memory access, the automaton 
is in state lastjmem . 

Self-Modifying Code. We consider self-modifying code independent of the implementa- 
tion of the memory interface. As an additional precondition for the correctness of code, 
we demand that in case an instruction is fetched from a memory location adr, there is 
a special iync-instruction between the last write to adr and the fetch of adr.^ In the 
VAMP architecture, this sync instruction is implemented without additional hardware 
by a special move from the lEEEf register to RQ as mentioned in section 5. We have 
formally verified that this use of the sync instruction suffices to show the correctness of 
the implementation in case of self-modifying code. 



7 Synthesis 

We have translated the PVS hardware description of the VAMP processor to Verilog 
HDL using an automated tool called pvs2hdl. The tool unrolls recursive definitions 
and then performs fairly straightforward translation. The Verilog representation of the 

’ Note that this bus control FSD is only a part of the FSD for the cache memory interface. 

* This implies the correspondency condition from [23]. 
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processor (including caches and floating point unit) has been synthesized, implemented, 
and tested on a Xilinx FPGA hosted on a PCI board. Some additional unverified hardware 
for controlling the VAMP processor and for accessing its memory from the host PC is 
also present on this FPCA. The VAMP processor occupies about 1 8000 slices of a Xilinx 
Virtex FPCA. This accounts for a gate count of 1 .5 million gates as reported by the Xilinx 
tools. The design contains 9100 bits of registers (not counting memory and caches) and 
runs at 10 MHz. 

Note that we assume a fully synchronous design, i.e., all registers share the same 
clock and RAM blocks for register files or caches are also updated synchronous to this 
clock; thus, concerning timing, they can be treated like registers. In a fully synchronous 
design, valid data is needed only at the rising edge of the clock with certain setup- 
and hold-times. The synthesis software analyzes all paths between inputs and registers, 
registers and registers, and registers and outputs; thus, it can guarantee that our logical 
design can be implemented with a certain maximum clock speed preserving all our 
proved properties. In particular, we fully ignore any glitches, i.e., instabilities in signals 
during a clock period that are resolved until the next rising edge of the clock since these 
glitches do not influence fully synchronous designs. Thus, our approach does not cover 
designs where certain signals must be kept stable for several cycles, i.e., where glitches 
must not occur. This is the case for asynchronous EDO-RAM chips that need stable 
addresses for a fixed amount of time. Since we use synchronous RAM chips, our proofs 
guarantee the correctness of the design regardless of any occurring glitches. 

We have ported the gcc and the CNU C library for the VAMP in order to execute test 
programs on the VAMP. As it was to be expected from our verified design, we found no 
errors in fhe VAMP processor. When fesfing some cases of denormal results of floating 
point operations, however, we found differences between the VAMP FPU and Intel’s 
Pentium II FPU. This is due to some discrepancies of Intel’s FPU to the IEEE standard. 
See [11] for further details. 



8 Conclusion 

Verification Effort. The formal verification of the VAMP microprocessor took about 
eight person-years; for the translation tool and synthesis on the FPCA, an additional 
person-year was required. Table 2 summarizes the verification effort for the different 
parts of the VAMP. Note especially that “Putting it all together” took a whole person- 
year for several reasons. First of all, the proof of the Tomasulo core from [12] was 
only generic and had to be applied to the VAMP architecture, especially the VAMP 
instruction set. Unfortunately, in spite of thorough planning on our part, the interfaces 
between the different parts did not match exactly. Thus, a lot of effort went into patching 
the interfaces. Additionally, self-modifying code and the special implementation of the 
JEEZiy -register had to be considered. Also, interrupt support and a memory unit still 
had to be added to the formally verified Tomasulo core. Lasf but not least, PVS does 
not really scale too well for projects this large; typechecking of the VAMP alone takes 
already more than two hours on our fastest machine. 

To the best of our knowledge, we have reported for the first time the formal ver- 
ification of i) a processor with the full DUX instruction set including load and store 
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Table 2. Verification effort 



Part 


Effort in years 


Lemmas 


Proof steps 


Tomasulo core & ALU 


2 


521 


14367 


FPU 


3 


1046 


25936 


Cache Memory Interface 


2 


566 


24432 


Putting it all together 


1 


415 


23887 


Total 


8 


2548 


88622 



instructions for bytes, half words, words, and double words, ii) a processor with delayed 
branch, iii) a processor with maskable nested interrupts, iv) a processor with integrated 
floating point unit, v) a memory system with separate instruction and data cache. More 
importantly, the above mentioned constructions and proofs are integrated into a single 
design and a single correctness proof. Thus, we can be sure that no oversimplifications 
have been made in any part of the design. PVS ensures that there are no proof gaps left. 

The design is synthesized® and implemented on an FPGA. The complexity of the 
design is comparable to industrial controllers with FPUs. To the best of our knowledge, 
VAMP is by far the most complex processor formally verified so far. 

We see several directions for further work in the near future, i) Adding a store buffer 
to the memory unit, ii) The treatment of a memory management unit with separate trans- 
lation look aside buffers for data and instructions, iii) Proving formally that a machine 
with memory management unit and appropriate page fault handlers as part of the op- 
erating system gives a single user program the view of a uniform virtual memory. This 
requires to argue about hardware and software simultaneously, iv) Redoing as much as 
possible of the present correctness proof with automatic methods. For such methods any 
subset of our lemmas lends itself as a benchmark suite with a very nice property: we 
know that it can be completed to the correctness proof of a full bit-level design. 
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Abstract. The productivity and scalability of verifying pipelined circuits can be 
increased by exploiting the structural and behavioural characteristics that distin- 
guish pipelines from other circuits. This paper presents a formal model of pipelines 
that augments a state machine with information to describe the transfer of par- 
cels between stages, and reading and writing state variables. Using our model, we 
created a definition of correctness that is based on the well-established principles 
of structural, control, and data hazards. We have proved that any pipeline that 
satisfies our hazards-based definition of correctness is guaranteed to satisfy the 
conventional correctness statement of Burch-Dill style flushing. 



1 Introduction 

In early verifications of pipelined circuits, the manual effort to discover abstraction fun- 
ctions limited both the productivity and scalability of verification. Burch and Dill’s use 
of flushing a pipeline to derive an abstraction function automatically [5] improved ve- 
rification productivity and scalability by sheltering the user from the complexities of 
the pipeline. Unfortunately, realistic circuits are beyond the scope of such push-button 
verification. To scale verification to larger pipelines, researchers invented a variety of 
decomposition strategies. Jones et al. used knowledge about pipeline behaviour to create 
incremental flushing [8]. Pnueli etal. [4] and Sawada and Hunt [12] used pipeline beha- 
viour as a guide for defining intermediate models. Hosabettu et al. developed completion 
functions to decompose pipelines stage-by-stage [7]. McMillan used knowledge about 
the behaviour of pipelines to guide assume-guarantee decomposition [10]. 

We believe that a model of state machines that captures the distinguishing structure 
and behaviour of pipelined circuits will improve verification productivity and scalability. 
The structure of a pipeline is a network of stages through which parcels (instructions) 
flow. The behaviour of a pipeline can be described using the principles of structural, 
control, and data hazards. This paper presents a formal model and a correctness statement 
for pipelines based on stages, parcels, and hazards. Our goals were: remain true to the 
intuitive meaning of pipelines and hazards, separate orthogonal concerns into distinct 
correctness obligations, and support cutting-edge optimizations. 

Our model of pipelines augments a state machine with pipeline-specific functions 
and predicates (Section 2): transferring a parcel between stages, writing to a variable, 
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and reading from a variable. The model supports superscalar and out-of-order execution, 
external kill signals, exceptions, external interrupts, bypass registers, and register rena- 
ming [2]. Our correctness statement, PipeOk, separates correctness obligations relating 
to different hazards, datapath functionality and flushing (Section 3). We have proved 
that any pipeline that satisfies PipeOk is guaranteed to satisfy the standard Burch-Dill 
flushing correctness statement (Section 4). 

PipeOk contains thirteen correctness obligations that provide a natural decomposi- 
tion strategy. Each obligation describes a single type of behaviour, for example, write- 
after-write hazards. Because hazards are well understood by both verification and design 
engineers, verification engineers will be able to more easily discuss test plans, verifica- 
tion strategies, and counter examples with designers. Because each obligation focuses 
on a single type of behaviour, verifying the obligations will be amenable to powerful 
abstraction mechanisms. For example, the ordering of reads and writes can be verified 
separately for each variable and need only reason about consecutive operations. 

To prove that PipeOk implies Burch-Dill correctness, we prove that PipeOk implies 
Flushpoint Equality (flushed states are externally equivalent to specification states) and 
then use the previously proven result that Flushpoint Equality implies Burch-Dill cor- 
rectness [3]. We prove that PipeOk implies Flushpoint Equality by showing: read and 
write operations happen in the correct order, the result of each write operation is correct, 
and finally that flushing works correctly. 

2 Modelling Pipelines 

This section describes our formal model of pipelines. We begin with an informal descrip- 
tion of the “parcel view” of a pipeline, which motivates our approach. The remainder of 
the section presents the model, auxilliary functions to relate a pipeline to its specification, 
and conditions to ensure that the auxilliary functions are consistent. 

2.1 The Parcel View of a Pipeline 

A pipeline is a network of stages. Parcels, or instructions, flow through the stages and 
read-from and write-to variables, or signals, in the pipeline. Figure 1 shows the runs of 
a sample program on an instruction set architecture specification, a four-stage pipelined 
microprocessor, and a “parcel view” of the pipeline. Each run is annotated to show when 
each parcel moves between stages and when each variable is read or written. The value 
of a variable is denoted by the label of the instruction that writes to the variable. 

Conventional verification strategies compare a snapshot of the pipeline state to a 
specification state. Because a pipeline state contains the effects of multiple partially 
executed parcels, it is difficult to relate the implementation to the specification. For 
example, step 4 of the pipeline contains parcels A, B, C, and D, which represents portions 
of steps 1, 2, 3, and 4 of the specification. A recent trend has been to examine the 
implementation only when it is in a flushed state, such as steps 0 and 9 of the pipeline, 
which are externally equivalent to steps 0 and 5 of the specification. 

The parcel view shows slices of the pipeline state as perceived by each parcel. 
Different variables in the same slice come from different points in time. The slice to 
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Fig. 1. Specification, pipeline and parcel view of a sample program 
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Table 1. Definition of a pipeline 



Conventional state machine 


state 


Set of states. 


Nsr 


: state — >■ state — >■ bool Next-state relation. 


isinit 


: state — >■ bool Initial-state predicate. 


Pipeline sets 


stage 


Set of identifiers for stages in the pipeline, including Top and Bot 


addri 


Set of identifiers for data storage variables in the pipeline. 


isExt 


: (a : addri) — >■ (q : state) — >■ bool Variable is externally visible. 


isStore 


: (a : addri) — >■ {q : state) — >■ bool Variable is for data storage. 


subPipes 


: {s : stage) — >■ pipe One pipe record for each stage 


Probes 


xfr 


: [q : state) — >■ (si : stage) — >■ (s2 : stage) — >■ bool 
In state q, a parcel transfers from si to S2 


Wr 


: (a : addri) — >■ {q : state) — >■ (s : stage) — >■ bool 
A parcel in s writes to address a in state q 


Rd 


: (a : addri) — >■ (q : state) — >■ (s : stage) — >■ bool 
A parcel in s reads from address a in state q 



the left (right) of each parcel shows the variahles as read (written) hy the parcel. Gray 
backgrounds denote values that are with the specification. For example, in step 2 of 
the parcel view, R1 is shown in gray, because R1 is I in the pipeline and A in the 
specification. The parcel for B is able to execute correctly, because it reads its operand 
from the bypass register, which corresponds to R1 at that time. 

The parcel view of pipelines was inspired by two observations: first, for each parcel, 
the only state variables that are relevant to its correctness are those that it reads or writes; 
second, if every parcel is executed correctly, then the pipeline is correct. Our proof 
that our correctness statement, PipeOk, implies Burch-Dill flushing relies on the parcel 
view of the pipeline. We have proved that if the order of read and write operations with 
respect to parcels in the pipeline is the same as the order with respect to states in the 
specification, then data dependencies are obeyed. 

2.2 Formal Model of Pipelines 

Our formal model of pipelines (Table 1) augments a standard model of non-deterministic 
state-machines with predicates to detect when parcels transfer between stages, read from 
state variables, and write to state variables. We use these predicates to compute the parcel 
view of a pipeline from the next-state relation. 

The predicate xfr detects the transfer of a parcel between two stages. We have defined 
instantiations of xfr for wide variety of protocols for transfering parcels [1]. Transfers 
can often be detected using one or two signals, such as the valid bits for the stages. In 
the set of stages. Top and Bot are virtual stages: they do not exist in the pipeline. For 
input/output pipelines, such as systolic arrays or execution units in microprocessors. Top 
represents the module in the environment from which parcels enter the pipeline and Bot 
represents the module to which parcels exit. For closed systems, such as microprocessors 
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Table 2. Functions for comparing a pipeline and specification 



1 Sets 1 


addcs 


Set of identifiers for data storage variables in the specification. 


datUs 


Set of data values in the specification. 


1 Structural-hazard correctness | 


Match 


: (a : run) — >■ (fi : time) — >■ : time) — >■ bool 

The parcel that enters at time ti exits at time 


1 Control-hazard correctness | 


ShouldExit 


: [a : run) — >■ {t : time) — >■ bool 

The parcel that enters should eventually exit 


1 Data-hazard and datapath correctness | 


addrmap 


: (a : addri) — >■ [q : state) — >■ addrs 

Maps addresses of implementation to addresses in the specification 


datamap 


: (a : addr) — >■ {q : state) — >■ datag 

Maps the data in q.a to corresponding specification data value 


1 Flushing correctness | 


Flush 


: state — >■ state Flushes a state 


IsFlushed 


: state — >■ bool A state is flushed 



with built-in memory, transfering from/to Top and Bot is defined in terms of operations in 
the pipeline, such as fetching an instruction. Pipelines may contain atomic stages, which 
hold at most one parcel, and hierarchical stages, which may themselves be pipelines. 
We support this with the subPipes field. 

State machines commonly distinguish internal and external variables (isExt for “is 
external”). We refine this by dividing variables into data-storage and pipeline variables 
{isStore for “is storage”). Data-storage variables are used to represent variables in the 
specification, and can be either internal ie.g., bypass registers) or external {e.g., register 
files). Pipeline variables are the registers that hold parcels in stages. They are internal 
and have no corresponding variables in the specification. Read and write predicates need 
only monitor storage variables. 



2.3 Relating Implementations and Specifications 

To verify a pipeline against a specification, we need to compare the behaviours of the 
pipeline and specification. Typically, this is done with a function to say how many 
instructions are fetched and an external-equivalence relation. Table 2 shows the analagous 
objects for our model. 

We use Match to identify the entrance and exit time of each parcel. Match supports 
superscalar pipelines by instantiating the type time with a pair of a clock cycle and a 
port [1]. When working with hierarchical pipelines, we want to treat the stages as black 
boxes. The Match relation allows us to match parcels entering and exiting stages while 
hiding the internal structure of the stage. We have found five common instantiations for 
Match: degenerate, constant latency, in-order, unique tags, and tagged in-order [1]. 
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Table 3. Consistency Conditions on Pipelines and Specifications 



Specification conditions 


1 The specification is deterministic. This is required for flushpoint-equality correctness to imply 
Burch-Dill correctness. Implementations may be non-deterministic. 


Traversal conditions 


2 If ShouldExit is true, then a parcel entered the pipeline. 


3 Parcels cannot transfer from the pipeline to the “Top” stage. 


4 Parcels cannot transfer from the “Bot” stage to the pipeline. 


5 Time increases monotonically as parcels traverse through the pipeline. 


6 IsFlushed cannot be true while a parcel is traversing through the pipeline. 


1 A storage operation can happen in a stage only if a parcel is in the stage. 


Storage Conditions 


8 If an address map changed, then a write must have happened in Impl. 


9 If a data map changed, then a write must have happened in Impl. 


10 If a Spec variable changed, then a write must have happened in Spec. 


11 When a pipeline is flushed, external equality and storage equality are identical. 


Flushing conditions 


12 Flush is idempotent on flushed pipelines. 


13 All reachable states are reachable from a flushed state. 


14 From any state, a flushed state can be reached eventually. 



The predicate ShouldExit says whether a parcel that enters the pipeline should be exe- 
cuted. We have identified instantations for ShouldExit that include external kill signals, 
branch prediction, internal exceptions, and external interrupts [2]. 

We separate external equivalence into two functions: addrmap, which defines a map- 
ping between variables in the pipeline and specification, and datamap, which maps data 
in the pipeline to the specification. Address maps may be dependent on the current state: 
the identity of the specification variable that a bypass register represents is dependent 
upon the contents the bypass register. When an implementation variable does not re- 
present any specification variable (e.g., a bypass register when it contains a bubble), 
addrmap returns _L, as shown in steps 0-2 for the pipeline in Figure 1. 

To relate PipeOk to flushpoint equality and Burch-Dill flushing, we require that each 
pipeline defines a function Flush and a predicate is Flushed. 



2.4 Consistency Conditions 

Table refconds summarizes the conditions required for the predicates and functions in 
the pipeline model to be consistent with the behaviour of the state machine in the model. 
The complete mathematical definitions appear in a technical report [2]. 



3 Correctness Obligations 

We begin with a summary of our notation. We present our correctness obligations ac- 
cording to the different types of hazards, datapath functionality, and flushing 
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3.1 Notation 

When working with theorems relating a run of a specification to a run of an implemen- 
tation, we often find it useful to draw “box” or commuting diagrams (Figure 2a). In 
Figure 2a, x and y refer to the states shown as circles. Properties associated with states 
and edges are listed in Figure 2b. We denote the element of a run a as: a*. We use 
run m a to mean that cr is a run of the state-machine m, as defined by: \/ 1. m a* 

As a syntactic shorthand, we write mqq' rather than m.Nsr q q', and we drop the name 
of the pipeline when refering to parameters other than Nsr. 



0 The initial state 

R A read is performed 
W A write is performed 
F The state is flushed 
W No write is performed 



Fig. 2b. State and step properties 



a Address 
q State 
s Stage 
t Time 

cr Run of a state machine 



Fig. 2c. Variable identifiers 
Fig. 2. Notation and conventions 



P 



Q 



P X 



(P x) A (Q y) 



P f Q 



(P x) A{Qy) A if X y) 



Q 

-9 



(P x) A{Qy) A {x < y) 



P f 

• — 



Q 

-• 



(P x) A{Qy) => if X y) 



P Q {P x) A {Q y) {x < y) 

• • 

P Q {Px)=^3y.Qy 



P f Q 



[Px)=^3 y. {Q y) A (fxy) 



P f Q ILLEGAL: (P x) A (/ a: y) (3 y. Q a;) 



Fig. 2a. Graphical notation 



3.2 Top-Level Correctness Statements 

Our top-level correctness statement. Definition 1, PipeOk, is the conjunction of thir- 
teen correctness obligations. Each correctness obligation guarantees that a particular 
type of behaviour is implemented correctly. Section 3.3 describes structural-hazard cor- 
rectness; Section 3.4 describes data-hazard correctness; Section 3.5 describes datapath 
functionality correctness; Section 3.6 describes additional correctness obligations nee- 
ded to ensure that flushed states are externally equivalent to specification states. There 
are no correctness obligations that address only control hazards. Instead, control hazards 
permeate both structural hazard correctness and data hazard correctness. For structural 
hazards, we make sure that correctly speculated parcels are executed and incorrectly spe- 
culated parcels do not exit the pipeline. For data hazards, we make sure that incorrectly 
speculated parcels do not leave behind data results that are read by correctly speculated 
parcels. 
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Definition 1 Correctness of pipelines 
PipeOk Impl Spec = 

Struct-hazard correctness 
^ 1 EnterTotFun Impl 

2 ExitTotFun Impl 

3 MatchIffTrav Impl 
A 

Data-hazard correctness 

^ 5 WawHazOk Impl Spec 
^ 4 RawHazOk Impl Spec 
^ 6 WarHazOk Impl Spec 
1 SpecRdTotFun Impl Spec 
^ 8 SpecWrTotFun Impl Spec 
9 ImplWrTotFun Impl Spec 



A 

Datapath correctness 

10 DatapathOk Impl Spec 
A 

Flushing correctness 

^11 ImplWrFlush Impl Spec 
^12 SpecWrFlush Impl Spec 

13 ImplInvalidateFlush Impl Spec 



3.3 Structural-Hazard Correctness Ohligations 

Structural hazard correctness is concerned with contention between parcels for resour- 
ces in the pipeline. Typical bugs associated with structural hazards are loss of parcels, 
duplication of parcels, generation of bogus parcels inside the pipeline, deadlock, and 
livelock. A pipeline handles its structural hazards correctly if there is a one-to-one map- 
ping between parcels that enter the pipeline and should exit and those parcels that do 
exit, and if the parcels that exit do so in the correct order. 

Definition 2 tracks a parcel as it traverses from stage to stage in a pipeline. The 
expression {ti , si) (f„, s„) means that in the run a, a parcel enters the stage si at ti, 
traverses from Si to s„, and exits the stage Sn at . In the base case si and s„ are the same 
stage. In the inductive case, there is an intermediate stage S 2 such that the parcel transfers 
from Si to S 2 and then traverses from S 2 to s„. To detect when the parcel exits si, we use 
the matching relation provided by si, according to our hierarchical model of pipelines. 
Definition 2 supports pipelines with loops, because Match separately identifies each 
iteration. We use to define Trav, which means a parcel traverses through the pipeline 
from Top to Bot. 

Definition 2 Traversing between stages in a pipeline (^) 

(tl,Si) (tn,Sn) = 

|’3f2,S2. 

Si Match a ti t 2 
^ xfr cr‘2 Si S 2 

(I 2 ? 'S 2 ) {inj ^n) 

Obligation 1 , EnterTotFun, says that for each time (t 1 ) that a parcel enters the pipeline 
and should exit, there exists exactly one time it 2 ) such that the parcels exits at ^2 (total 
and functional). Obligation 2, ExitTotFun, says that each parcel that exits the pipeline 
(XfrOut) comes from exactly one parcel that entered the pipeline and should have exited 
(surjective and injective). Together, Obligations 1 and 2 guarantee that the relationship 
between entering and exiting parcels is bijective. 



A 



Si — Sn 

Si Match a tit X 
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Obligation 1 Each entrance results in exactly one exit 



EnterTotFun Impl = 



V 



run Impl Oi 
isFlushed 
ShouldExit Ui t\ 



3! t 2 - Trav Impl Oi ti t 2 



Obligation 2 Each exit comes from exactly one entrance 
ExitTotFun Impl = 

V (7i, 12- 

run Impl Ui 
IsFlushed 
XfrOut 

Obligation 3, MatchlffTrav, says that parcels that exit the pipeline do so in the 
correct order, as defined by the pipeline-specific matching relation (Match). MatchlffTrav 
allows pipelines to be treated as black boxes in hierarchical verification, by relating 
the traversal of parcels inside the pipeline, Trav, to the entrance and exit of parcels. 

Obligation 3 Match correctly identifies when a parcel traverses the pipeline 
MatchlffTrav Impl = 

V cr, fi, f 2 - [Match Impl a t\ tf) \Trav Impl a t\ ^ 2 ] 





3!fi. 




^ Trav Impl <Ji t\ t 2 
ShouldExit ai ti 



3.4 Data-Hazard Correctness Obligations 

A data-depenency exists between a producing (writing) instruction and a consuming 
(reading) instruction if the producing instruction writes to an address that the consu- 
ming instruction reads from and no instruction between the producer and the consumer 
writes to that address. A pipeline implements data dependencies correctly if every data 
dependency in the specification is obeyed in the implementation. 

Data hazards are categorized as: read-after-write, write-after-read, and write-after- 
write. If a pipeline handles all three types of data hazards correctly, then it implements 
data dependencies correctly. In Figure 3, the gray lines represent orderings between 
specification and implementation operations that will violate the dependency between 
W; and Ri. Read-after-write (Raw) hazard correctness guarantees that R; occurs after 
W;. Together, write-after- write and write-after-read hazard correctness guarantee that no 
write will occur to this address between W; and Ri. Write-after- write (Waw) correctness 
guarantees that no programmatically earlier write happens after Wj. Write-after-read 
(War) correctness guarantess that no programmatically later write will occur before R; . 
Figure 3 has many simplifications that are violated by optimizations such as bypass re- 
gisters, register renaming, and out-of-order execution. Our formalization supports these 
optimizations using dynamic address maps, multiple writes, and out-of-order writes [2]. 

The data hazard obligations ensure that reads and writes in the implementation occur 
in the correct order. We use the symbols wr<Rd , Rd^Wr , and wr<Wr to denote consecutive 
write and read operations in a run. Definition 3 describes a read following a write to the 
address a in the run cr. To the right of the text is an illustration of the definition using the 
graphical notation presented in Figure 2a. The definitions for a write following a read 
and a write following a write are similar. 
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Fig. 3. Data-dependencies and the three types of data hazards 



Definition 3 Consecutive read-after-write ordering 

(7 

{twj Suj) Wr^Rd Sr) = 

. Wr a (T*” s„. 



Rd a CT*'" Sr 

trr ^ tr 



w ro R 



V f G {t^ + l..tr — 1}. V s. -•{Wr a a* s) 

Obligation 4, RawHazOk, says that if there is a data-dependency in the spe- 
cification and a corresponding read in the implementation {ai,tri, Sr), then there 
must exist a corresponding write {ai,twi,Sw) that happens before the read. 



Obligation 4 Correctness of read-after-write data hazards 
RawHazOk Impl Spec = 

V Clg^ trrsi i’rs'i tri^ 



<ys 

{cLsfrs) (ttz5 frzj S^.) 



3 



^ws ^ trs 



i’wi ^ frt 1 



RawHazOk contains the first appearance of the relation Spec Impl (“run cor 

RUN 



respondence”) which says that: as is a run of Spec, Ui is a run of Impl from a flushed 
state, and the initial states of as and ai are externally equivalent. 

We formalize an operation in a run of an implementation corresponding to an ope- 
ration in the specification by tracking a parcel as it traverses the pipeline. The parcel 
that enters the pipeline and should exit corresponds to the step of the specification. 
The expression tg {tin, Sn) means that at time tm, the parcel that entered the 
pipeline and shoulcfexit is either inside the stage s„ or is just exiting s„. 

Read and write correspondences are defined in terms of parcel correspondence 
(p^). The expression {as,tws)'=^t {ai,twi, s) means: the specification instruction 
at time tws writes to address Qs, the instruction corresponds to the parcel in stage s at 
time trui, the parcel writes to Ui, and the address map of Oj at time tn,i points to a^. 

Write-after-write and write-after-read hazards are dealt with by Obligations 5 and 
6, both of which have a case for in-order writes and a case for out-of-order writes. 
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The in-order cases are simpler, because they deal only with consecutive operations, as 
denoted by W. The out-of-order cases require looking beyond consecutive operations, 
because we do not know how far out-of-order the operations will be. We use “wi^Wr ” 
and “ RrfCwr ” for the transitive ordering of write and read operations. 

Obligation 5 Correctness of write-after-write data hazards 
WawHazOk Impl Spec = 

tJsj Cls^ f-msl j tZi , t'ujiij S^i^i , 7 ■ 

Spec"^=^^ Impl 

RUN 

{tls 7 tyjsl) (tZi , fu>il J 



V tws2- 
A 



<ys 

^wsl Wr'^Wr ^'ws2 


tws1<7Qtws2 




11 

^wil ^ ^wi2 


{dsjiws2) {dijtii,i2t ^w2) 

•) irs2-) ^wi2y ^w2'j ^ri2-> ^r2’ 


i 1 

twi1 lwi2 


^3 

^wsl Wt^Wr t-ws2 




dg 

<7s 


twsi tws2^trs2 


tws2\/^if<p{^trs2 

dg 

(^ 87 ^ 1032 ) 7 ^u>i2 7 '^u>2) 

(tts7 trs2) (^Z7 ^ri2i ^r2'} 




^ri2 ^ ^wil ^ 


twi2 tri2 



Obligation 6 Correctness of write-after-read data hazards 
WarHazOk Impl Spec = 

^ S7 ^i7 tZs ; trsl ; tZj , tril ; ■ 

Spec Impl 

RUN 

A 



(zZsjfrsl) (tti , trZl j ) 



Vf 



itis2 • 



V 



A 



trsl 

(^s;^ttJs2) 5 ^ii;i2 ) ^1172) 



Vs1 \7^tws2 



^ ^ws2 1 ■ 

<7s 

^rsl Rcf^Wr ^ws2 



A 



5 ) Wr'^Rd ’ ^rl) 



^ril — ^wi2 



^wi2 ^ ^wil 




Wi2 Wi1 TO kil 



The out-of-order case for Obligation 5 requires that turn does not corrupt data by 
occurring between another implementation write (twa) and its dependent read {tra)- 
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The out-of-order case of Obligation 6, WarHazOk, is simpler than that of Obligation 5, 
WawHazOk, because we do not need to mention the specification write that corresponds 
to tyjii- The purpose of the out-of-order case is to allow tu]i 2 to happen before 
while ensuring that does not corrupt the data intended for tru - If twi 2 corrupts the 
data, then will be the producer for trii, which causes the right-hand-side of the 
implication to be < twi 2 , which is clearly false. 

Obligations 4-6 guarantee that, if read and write operations occur in the implemen- 
tation, then they will occur in the correct order. These obligations do not guarantee 
that the operations actually do occur in the implementation. Obligations 7-9 ensure 
that reads and writes in the specification will also occur in the implementation and 
that writes that occur in the implementation correspond to writes in the specification. 
For brevity, we omit the mathematical definitions, which can be found elsewhere [2]. 

Obligation 7 SpecRdTotFun Impl Spec = Each read operation in Spec corresponds 
to exactly one read operation in Impl 

We allow multiple writes in the implementation to correspond to a single write 
in the specification, so long as the writes are to different variables (Obligation 8, 
SpecWrTotFun). This feature is required to support simple optimizations, such as by- 
pass registers, as well as complex optimizations, such as retirement register files. 

Obligation 8 SpecWrTotFun Impl Spec = Each write in Spec has at least one 
corresponding write in Impl. If two writes in Impl correspond to the same write 
in Spec, then the Impl writes must be to different addresses in Impl. 

We allow implementations to perform writes that do not correspond to writes in 
the specification, so long as these writes are not read (Obligation 9, ImplWrTotFun). 
This freedom provides a uniform mechanism for implementations to invalidate data, 
(remapping a register in register renaming) as well as modify the contents of variables 
that are not needed (bubbles changing the value of a bypass register as they propagate 
through it). A variable is invalid if its address map is changed so that it no longer points 
to an address in the specification. As shown in Figure 1, when a bypass register contains 
a bubble, we say that its address map returns _L. Obligations 1 1-13 in Section 3.6 ensure 
that these writes do not corrupt data before a flushed state. 

Obligation 9 ImplWrTotFun Impl Spec = Each write in Impl that is the last write 
before a read from the same address must have a corresponding write in Spec. 



3.5 Datapath Correctness Obligation 



Definition 4 describes when two storage variables are equivalent: their address 
maps point to the same address and their data maps return the same data value. 



Definition 4 Equality of storage variables 

(« 2 , 92 ) = 

[tti = addrmap 02 92 ] A [q\.ai 



datamap 02 92 ] 



The datapath of a pipeline is correct if, assuming every read operation that a parcel 
performs will consume the correct data, then every write that parcel performs must 
produce the correct data (Obligation 10, DatapathOk). The clause dealing with reads is 
nested within the antecedent to provide a uniform way of dealing with both parcels that 
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performs reads and those whose results are independent of the contents of the pipeline 
storage variables. 



Obligation 10 Correctness of datapath (DatapathOk) 
DatapathOk Impl Spec = 

V O' s f tXs 5 5 tluji , twi i • 



Spec Impl 

A RUN 

V (Xrsy 



A 



iars,crl‘) 3=^ (art,cr‘”) 



T^S + l 



) = (au)i,cr”‘ ) 

^ RTCIRF V ’ Z / 



3.6 Flushing Correctness Obligations 

Using Obligations 1-10, we have proved that every parcel that enters the pipeline and 
should exit, will produce the correct result (WriteOk in Figure 4). It may seem that 
this is a sufficient definition of correctness, however it allows externally visible state 
variables that are written but never read to contain incorrect data. We solve this problem 
with Obligations 1 1-13 (mathematical definitions appear elsewhere [2]). Obligation 11, 
ImplWrFlush, is analogous to Obligation 9, ImplWrTotFun, except that it is concerned 
with writes before flushed states, rather than writes before reads. Obligation 12, SpecWr- 
Flush, ensures that in a flushed implementation state, the last writes that happened in the 
specification have corresponding writes in the implementation. Finally, Obligation 13, 
ImplInvalidateFlush, ensures that for each specification variable, there is at least one 
corresponding implementation variable. This is done by preventing the invalidation of 
the last corresponding implementation variable. 

Obligation 11 ImplWrFlush Impl Spec = Last visible writes in impl before flushed 
states correspond to writes in spec. 

Obligation 12 SpecWrFlush Impl Spec = Last visible writes in spec occur in impl 

Obligation 13 ImplInvalidateFlush Impl Spec = If the address map of a variable (afl 
changes, then in the next clock cycle there must be another implementation variable 
( 02 ) such that the address map of 02 points to the same specification address as ai 
used to point to. 



4 Proof That Hazard- Correctness Implies Burch-Dill Correctness 

The proof that PipeOk implies Burch-Dill flushing (Theorem 1) contains four major 
steps that are linked by transitivity (Figure 4). In the first step, we used the correctness 
obligations for structural, control, and data hazards (Obligations 1-9) to prove that 
the read and write operations in the implementation obey data dependencies in the 
specification. That is, the operations exist and occur in the correct order (DataDepOk). In 
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the second step, we combined the ordering of data-storage operations with the correctness 
of the datapath (Obligation 10) to prove that every write operation writes the correct data 
(WriteOk). In the third step, we combined the correctness of write operations with the 
correctness obligations for flushing (Obligations 11-13) to prove that when a pipeline 
is in flushed state, it will correspond to the specification (FlushedEq). The definition of 
FlushedEq comes from the Microbox work of Aagaard et al [3], where it is identified by 
the acronym iFEND for “informed-flushpoint with equality between a non-deterministic 
implementation and a deterministic specification”. 



Definition 5 Burch-Dill correctness 

BurchDillOk Impl Spec = 

V qi,qs,q[- 

Flush qi = qs 



Impl qi q[ 
DoesFetch qi q[ 



l^q's 



Flush q' = q' 

/\ EXT 

Spec qs q's 



Theorem 1 Pipeline correctness implies Burch-Dill correctness 
PipeOkImpBurchDillOk = 

V Impl, Spec. 

PipeOk Impl Spec BurchDillOk Impl Spec 



PipeOk ^ BurchDillOk 




5 Conclusions 

Some related work has been on correctness for pipelined circuits. Tahar and Kumar de- 
fined correctness statements for the different types of hazards in a single-scalar, in-order 
microprocessor [13]. Manolios has used bisimulation and retiming to relate the run of a 
pipeline to a specification using state-based abstraction functions, such as flushing [9]. 
Mishra et al defined correctness for pipelined microprocessors with the restriction that 
instructions proceed from stage to stage in a lockstep order [11]. 

Some of the lemmas and decomposition strategies used by others are similar to 
correctness obligations in our work. McMillan’s inductive proof to show that each in- 
struction that reads correct data will write correct results [10] is similar to our obligation 
for datapath correctness. Sawada’s MAETT annotates implementation states with history 
and prophecy variables to facilitate separating the effects of individual instructions [12]. 
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This is similar in flavour to our use of read and write operations to identify the relevant 
state variables for each instruction. Ho’s token networks [6] are a verification strategy 
that might yield useful abstractions to verify our structural hazard obligations. 

The goal of the work presented here was to establish a formal foundation for pipelined 
circuits that would increase verification capacity and productivity, be intuitive to both 
verification engineers and design engineers, and handle cutting-edge optimizations in 
pipelines. We have defined a formal model and correctness statement {PipeOk) based 
upon conventional notions of stages, parcels, and hazards. We have proved that the 
correctness statement guarantees Burch-Dill flushing correctness. PipeOk is comprised 
of thirteen correctness obligations: three for structural hazards, six for data hazards, one 
for the datapath, and three for flushing. Control hazards are integrated into structural 
and data hazard correctness. The correctness obligations each deal with a specific type 
of behaviour, which should make them amenable to powerful abstraction and problem 
reduction techniques. We have begun several case studies to evaluate the effectiveness 
of PipeOk using a combination of model checking and theorem proving. After the case 
studies indicate that our model and correctness statement are effective, we will mechanize 
the proof that PipeOk implies Flushpoint Equality. 
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Abstract. We present a non-operational approach to specifying and 
analyzing shared memory consistency models. The method uses higher 
order logic to capture a complete set of ordering constraints on execution 
traces, in an axiomatic style. A direct encoding of the semantics with a 
constraint logic programming language provides an interactive and incre- 
mental framework for exercising and verifying finite test programs. The 
framework has also been adapted to generate equivalent boolean satisfi- 
ability (SAT) problems. These techniques make a memory model spec- 
ification executable, a powerful feature lacked in most non-operational 
methods. As an example, we provide a concise formalization of the Intel 
Itanium memory model and show how constraint solving and SAT solv- 
ing can be effectively applied for computer aided analysis. Encouraging 
initial results demonstrate the scalability for complex industrial designs. 



1 Introduction 

Modern shared memory architectures rely on a rich set of memory access re- 
lated instructions to provide the flexibility needed by software. For instance, 
the Intel Itanium^^ processor family [1] provides two varieties of loads and 
stores in addition to fence and semaphore instructions, each associated with dif- 
ferent ordering restrictions. A memory model defines the underlying memory 
ordering semantics. Proper understanding of these ordering rules is essential for 
the correctness of shared memory consistency protocols that are aggressive in 
their ordering permissiveness, as well as for compiler transformations that rear- 
range multithreaded programs for higher performance. Due to the complexity 
of advanced computer architectures, however, practicing designers face a serious 
problem in reliably comprehending the memory model specification. 

Gonsider, for example, the assembly code shown in Fig. 1 that is run con- 
currently on two Itanium processors (such code fragments are generally known 
as litmus tests): The first processor, PI, executes a store of datum 1 into ad- 

* This work was supported by a grant from the Semiconductor Research Corporation 
for Task 1031.001, and Research Grants CCR-0081406 and CCR-0219805 of NSF. 
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PI P2 

St a,l; Id.acq rl,b; 

st.rel b,l; Id r2,a; 



Fig. 1. A litmus test showing the ordering properties of store-release and load-acquire. 
Initially, a = b = 0. Can it result in rl = 1 and r2 = 0? The Itanium memory model 
does not permit this result. 

dress a; it then performs a store-release^ of datum 1 into address b. Processor 
P2 performs a load-acquire from b, loading the result into register rl. It is fol- 
lowed by an ordinary load from location a into register r2. The question arises: 
if all locations initially contain 0, can the final register values be rl=l and r2=0? 
To determine the answer, the Itanium memory model must be consulted. The 
formal specification of the Itanium memory model is given in an Intel appli- 
cation note [2]. It comprises a complex set of ordering rules, 24 of which are 
expressed explicitly based on a large amount of special terminology. One can 
follow a pencil-and-pen approach to reason that the execution shown in Fig. 1 is 
not permitted by the rules specified in [2]. Based on this, one can conclude that 
even though the instructions in P2 pertain to different addresses, the underlying 
hardware is not allowed to carry out the ordinary load at the beginning, and by 
the same token, a shared memory consistency protocol or an optimizing com- 
piler cannot reorder the instructions in P2. A further investigation shows that 
the above result would be permitted if the st.rel in PI is changed to a st, or 
the Id.acq in P2 is changed to a Id. Therefore, st.rel and Id.acq must both 
be used in pairs to achieve the “barrier” effect in this scenario. 

A litmus test like this can reveal critical information to help system designers 
make right decisions in code selection and optimization. But as bigger tests are 
used and more intricate rules are involved, trace properties quickly become non- 
intuitive and hand-proving program compliance can be very difficult. How can 
one be assured that there does not exist an interacting rule that might introduce 
unexpected implications? Also, a large scale design is often composed of simpler 
components. To avoid being overwhelmed by the overall complexity, a useful 
technique is to isolate the rules related to specific architectural features so that 
the model can be analyzed piece by piece. For example, if one can selectively 
enable/disable certain rules, one may quickly find out that the “program order” 
rules in [2] are critical to the scenario in Fig. 1 while many others are irrelevant. 

These issues suggest that a series of useful features is needed from the speci- 
fication framework to help people better understand the underlying model. Un- 
fortunately, most non-operational specification methods leave these issues unre- 
solved because they use notations that do not support analysis through execu- 

^ Briefly, a store-release instruction will, at its completion, ensnre that all previous in- 
structions are completed; a load-acquire instruction correspondingly ensures that all 
following instructions will complete only after it completes. These explanations are 
far from precise - what do “previous” and “completion” mean? A formal specification 
of a memory model is key to precisely capture these and all similar notions. 
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tion. Given that designers need lucid and reliable memory model specifications, 
and given that memory model specifications can live for decades, it is crucial 
that progress be made in this regard. 

In this paper, we take a fresh look at the non-operational specification method 
and explore what verification techniques can be applied. We make the following 
contributions. First, we present a compositional method to axiomatically cap- 
ture all aspects of the memory ordering requirements, resulting a comprehensive, 
constraint-based memory consistency model. Second, we propose a method to 
encode these specifications using FD-Prolog.^ This enables one to perform inter- 
active and incremental analysis. Third, we have harnessed a boolean satisfiability 
checker to solve the constraints. To the best of our knowledge, this is the first 
application of SAT methods for analyzing memory model compliance. As a case 
study of this approach, we have formalized a core subset of the Itanium memory 
model and used constraint programming and boolean satisfiability for program 
analysis. 



Related Work. The area of memory model specification has been pursued 
under different approaches. Some researchers have employed operational style 
specifications [3] [4] [5] [6], in which the update of a global state is defined 
step-by-step with the execution of each instruction. For example, an operational 
model [4] for Sparc V9 [7] was developed in Murphi. With the model checking 
capability supported by Murphi, this executable model was used to examine 
many code sequences for Sparc V9. While the operational descriptions often 
mirror the decision process of an implementer and can be exploited by a model 
checker, they are not declarative. Hence they tend to emphasize the how aspects 
through their usage of specific data structures, not the what aspects that formal 
specifications are supposed to emphasize. 

Other researchers have used non-operational (also known as axiomatic) spec- 
ifications, in which the desired properties are directly defined. Non-operational 
styles have been widely used to describe conceptual memory models [8] [9] . One 
noticeable limitation of these specifications is the lack of a means for automatic 
execution. An axiomatic specification of the Alpha memory model was written 
by Yu [10] in Lisp. Litmus tests were written in S-expression. Verification con- 
ditions were generated for the litmus tests and fed to the Simplify [11] verifier 
of Compaq/SRC. In contrast, our specification is much closer to the actual in- 
dustrial specification, thanks to the declarative nature of FD-Prolog. The FD 
constraint solver offers a more interactive and incremental environment. We have 
also applied SAT and demonstrated its effectiveness. 

Lamport and colleagues have specified the Alpha and Itanium memory mod- 
els in TLA-I- [12] [13]. These specifications build visibility order inductively and 
support the execution of litmus tests. While their approach also precisely spec- 
ifies the ordering requirement, the manner in which such inductive definitions 

^ FD-Prolog refers to Prolog with a finite domain (FD) constraint solver. For example, 
SICStus Prolog and GNU Prolog have this feature. 
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True 




Sat 

(Legal execution) 
Unsat 

(Illegal execution) 



Fig. 2. The process of making an axiomatic memory model executable. Legality of a 
litmus test can be checked by either a constraint solver or a SAT solver. 



are constructed will vary from memory model to memory model, making com- 
parisons among them harder. Our method instead relies on primitive relations 
and directly describes the components to make up a full memory model. This 
makes our specification easier to understand, and more importantly, to compare 
against other memory models. This also means we can enable or disable some 
sub-rules quite reliably without affecting the other primitive ordering rules - a 
danger in a style which merges all the ordering concerns in a monolithic manner. 

Roadmap. In the next section, we introduce our methodology. Section 3 de- 
scribes the Itanium memory ordering rules. Section 4 demonstrates the analysis 
of the Itanium memory model through execution. We conclude and propose 
future works in Section 5. The concise specification of the Itanium ordering con- 
straints is provided in the Appendix, with additional details appearing at our 
web site http://www.es. utah.edu/formal_verification/itanium. 

2 Overview of the Framework 

A pictorial representation of our methodology is shown in Fig. 2. We use a 
collection of primitive ordering rules, each serving a clear purpose, to specify 
even the most challenging commercial memory models. This approach mirrors 
the style adopted in modern declarative specifications written by the industry, 
such as [2] . Moreover, by using pure logic programs supported by certain modern 
flavors of Prolog that also include finite domain constraints, one can directly 
capture these higher order logic specifications and also interactively execute the 
specifications to obtain execution results for litmus tests. Alternatively, we can 
obtain SAT instances of the boolean constraints representing the memory model 
through symbolic execution, in which case boolean satisfiability tools can be 
employed to quickly answer whether the tests are legal or not. 

2.1 Specification Method 

To define a memory model, we use predicate calculus to specify all constraints 
imposed on an ordering relation order. The constraints are almost completely 
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first-order. However, since order is a parameter to the specification, the con- 
straints are most easily captured with higher order predicate calculus (we use the 
HOL logic [14]). Previous non-operational specifications often implicitly require 
general ordering properties, such as totality, transitivity, and circuit-freeness. 
This is the main reason why such specifications cannot readily be executed. 
In contrast, we are fully explicit about such properties, and so our constraints 
completely characterize the memory model. 

The flexibility of our notation allows us to specify different memory models 
under the same framework. We have assembled a large collection of constraints 
for many conventional memory models, such as Sequential Consistency [8], Co- 
herence, Causal Consistency [9], PRAM [15], and Processor Consistency [16]. 
Due to space limitation, this paper concentrates on demonstrating how to spec- 
ify and analyze the Itanium memory ordering rules. 



2.2 Executing Axiomatic Specifications 

A straightforward transcription of the formal predicate calculus specification 
into a Prolog-style logic program makes it possible for interactive and incre- 
mental execution of litmus tests. This encourages exploration and experiment 
in the development and validation of complex coherence protocols. To make a 
specification executable, we instantiate it over a finite execution and convert the 
verification problem to a satisfiability problem. 

The Algorithm. Given a finite execution ops with n operations, there are 
ordering pairs, constituting an ordering matrix M., where the element Mij indi- 
cates whether operations i and j should be ordered. We go through each ordering 
rule in the specification and impose the corresponding constraint regarding the 
elements of Ai. Then we check the satisfiability of all the ordering requirements. 
If such a M exists, the trace ops is legal, and a valid interleaving can be derived 
from A4. Otherwise, ops is not a legal trace. 

Applying Constraint Logic Programming. Logic programming differs from 
conventional programming in that it describes the logical structure of the prob- 
lems rather than prescribing the detailed steps of solving them. This naturally 
reflects the philosophy of the axiomatic specification style. As a result, our formal 
specification can be easily encoded using Prolog. Memory ordering constraints 
can be solved through a conjunction of two mechanisms that FD-Prolog read- 
ily provides. One applies backtracking search for all constraints expressed by 
logical variables, and the other uses non-backtracking constraint solving based 
on arc consistency [17] for FD variables, which is potentially more efficient and 
certainly more complete (especially under the presence of negation) than with 
logical variables. This works by adding constraints in a monotonically increasing 
manner to a constraint store, with the in-built constraint propagation rules of 
FD-Prolog helping refine the variable ranges (or concluding that the constraints 
are not satisfiable) when constraints are asserted to the constraint store. 
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Applying Boolean Satisfiability Techniques. The goal of a boolean satis- 
fiability problem is to determine a satisfying variable assignment for a boolean 
formula or to conclude that no such assignment exists. A slight variant of the 
Prolog code can let us benefit from SAT solving techniques, which have advanced 
tremendously in recent years. Instead of solving constraints using a FD solver, 
we can let Prolog emit SAT instances through symbolic execution. The resultant 
formula is true if and only if the litmus test is legal under the memory model. 
It is then sent to a SAT solver to find out the result. 

3 Specifying the Itanium Memory Consistency Model 

The original Itanium memory ordering specification is informally given in various 
places in the Itanium architecture manual [I]. Intel later provided an applica- 
tion note [2] to guide system developers. This document uses a combination of 
English and informal mathematics to specify a core subset of memory opera- 
tions in a non-operational style. We demonstrate how the specification of [2] 
can be adapted to our framework to enable computer aided analysis. Virtually 
the entire Intel application note has been captured.^ We assume proper address 
alignment and common address size for all memory accesses, which would be the 
common case encountered by programmers (even these restrictions could be eas- 
ily lifted). The detailed definition of the Itanium memory model is presented in 
the Appendix. This section explains each of the rules. The following definitions 
are used throughout this paper: 

Instructions — Instructions with memory access or memory ordering seman- 
tics. Five instruction types are defined in this paper: load-acquire (id.acq), 
store-release (st.rel), unordered load (id), unordered store (st), and memory 
fence (mf). An instruction i may have read semantics (isRd i = true) or write 
semantics (isWr i = true). Ld.acq and Id have read semantics. St.rel and 
St have write semantics. Mf has neither read nor write semantics. Instructions 
are decomposed into operations to allow a finer specification of the ordering 
properties. 

Execution — Also known as an execution trace, contains all memory operations 
generated by a program. Stores are annotated with the write data and loads are 
annotated with the return data. An execution is legal if there exists an order 
among the operations that satisfies all memory model constraints. 

Address Attributes — Every memory location is associated with an address 
attribute, which can be write-back (WB), uncacheable (UC), or write-coalescing 
(WC). Memory ordering semantics may vary for different attributes. Predicate 
attribute is used to find the attribute of a location. 

® This paper formally captures 21 out of 24 rules from [2]. The remaiuing 3 rules 
deal with semaphore operations, which are straightforward to add using the same 
approach. 




Analyzing the Intel Itanium Memory Ordering Rules 



87 



Operation Tuple — A tuple containing necessary attributes is used to mathe- 
matically describe memory operations. A memory operation i is represented by 
a tuple (P,Pc,Op,Var,Data,WrId,WrType,WrProc,Reg,UseReg,Id ), where 



p I = P : 
pc z = Pc : 
op i = Op : 
var i = V ar : 
data i = Data : 
wrID z = Wrid : 
wrType z = WrType : 
wrProc z = WrProc : 
reg z = Reg : 
useReg z = UesReg : 
id i = Id : 



issuing processor 
program counter 
instruction type 
shared memory location 
data value 

identifier of a write operation 

type of a write operation 

target processor observing a write operation 

register 

flag of a write indicating if it uses a register 
global identifier of the operation 



A read instruction or a fence instruction is decomposed into a single oper- 
ation. A write instruction is decomposed into multiple operations, comprising 
a local write operation (wrType i = Local) and a set of remote write opera- 
tions (wrType z = Remote) for each target processor (wrProc z), which also 
includes the issuing processor. Every write operation i that originates from a 
single write instruction shares the same program counter (pc i) and write ID 
(WrID i). 



3.1 The Itanium Memory Ordering Rules 

As shown below, predicate legal is a top-level constraint that defines the 
legality of a trace ops by checking the existence of an order among ops that 
satisfies all requirements. Each requirement is formally defined in the Appendix. 

legal ops = 3 order. 

requireLinearOrder ops order A 
requireWriteOperationOrder ops order A 
requireProgramOrder ops order A 
requireMemoryDataDependence ops order A 
requireDataFlowDependence ops order A 
requireCoherence ops order A 
requireReadValue ops order A 
requireAtomicWBRelease ops order A 
requireSequentialUC ops order A 
requireNoUCBypass ops order 

Table 1 illustrates the hierarchy of the Itanium memory model definition. 
Most constraints strictly follow the rules from [2] . We also explicitly add a pred- 
icate requireLinearOrder to capture the general ordering requirement since [2] 
has only English to convey this important ordering property. 
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Table 1. The specification hierarchy of the Itanium memory ordering rules. 



requireLinearOrder 


requireMemoryDataDependence 


requireRead Value 


- requireWeakTotal 


- MD;RAW 


- validWr 


- requireTransitive 


- MDWAR 


- validLocalWr 


- requireAsymmetric 


- MDWAW 


- validRemoteWr 

- validDefaultWr 


require WriteOperationOrder 


requireDataFlowDependence 


- validRd 


- local/remote case 


- DF:RAR 




- remote/remote case 


- DF:RAW 

- DF:WAR 


requireNo U CBypasss 


requireProgram Order 




requires equential U C 


- acquire case 


requireCoherence 


- RAR case 


- release case 


- local/local case 


- RAW case 


- fence case 


- remote/remote case 
requireAtomic WBRelease 


- WAR case 

- WAW case 



General Ordering Requirement (Appendix A.l). This requires order to 
be a weak total order which is also circuit-free. 

Write Operation Order (Appendix A. 2). This specifies the ordering among 
write operations that originate from a single write instruction. It guarantees that 
no write can become visible remotely before it becomes visible locally. 

Program Order (Appendix A. 3). This restricts reordering among instruc- 
tions of the same processor with respect to the program order. 

Memory-Data Dependence (Appendix A. 4). This restricts reordering 
among instructions from the same processor when they access common loca- 
tions. 

Data-Flow Dependence (Appendix A. 5). This is intended to specify how 
local data dependency and control dependency should be treated. However, this 
is an area that is not fully specified in [2]. Instead of pointing to an informal 
document as done in [2], we provide a formal specification covering most cases 
of data dependency, namely establishing data dependency between two memory 
operations by checking the conflict usages of local registers.^ 

Coherence (Appendix A. 6). This constrains the order of writes to a common 
location. If two writes to the same location with the attribute of WB or UC become 
visible to a processor in some order, they must become visible to all processors 
in that order. 

We do not cover branch instructions or indirect-mode instructions that also induce 
data dependency. We provide enough data dependency specification to let designers 
experiment with straight-line code that uses registers - this is an important require- 
ment to support execution. 
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PI 




P2 




(1) 


st_local(a, 1) ; 


(7) 


Id. acqd ,b) 


(2) 


st_remotel (a, 1) ; 


(8) 


ld(0,a) ; 


(3) 


st_remote2 (a, 1) ; 






(4) 


St . rel_local (b , 1) ; 






(5) 


St . rel_remotel (b, 1) ; 






(6) 


St . rel_remote2(b, 1) ; 







Fig. 3. An execution resulted from the program in Fig. 1. Stores are decomposed into 
local stores and remote stores. Loads are associated with return values. 



Read Value (Appendix A. 7). This defines what data can be observed by 
a read operation. There are three scenarios: a read can get the data from a 
local write (validLocalWr), a remote write (validRemoteWr), or the default 
value (validDef aultWr). Similar to shared memory read value rules, predicate 
validRd guarantees consistent assignments of registers - the value of a register 
is obtained from the most recent previous assignment of the same register. 

Total Ordering of WB Releases (Appendix A. 8). This specifies that store- 
releases to write-back (WB) memory must obey remote write atomicity, i.e., 
they become remotely visible atomically. 

Sequentiality of UC Operations (Appendix A. 9). This specifies that op- 
erations to uncacheable(UC) memory locations must have the property of se- 
quentiality, i.e., they must become visible in program order. 

No UC Bypassing (Appendix A. 10). This specifies that uncacheable(UC) 
memory does not allow local bypassing from UC writes. 



4 Making the Itanium Memory Model Executable 

We have developed two methods to analyze the Itanium memory model. The 
first, as mentioned earlier, uses Prolog backtracking search, augmented with 
finite-domain constraint solving. The second approach targets the powerful SAT 
engines that have recently emerged. 

The Constraint Logic Programming Approach 

Our formal Itanium specification is implemented in SICStus Prolog [18]. Litmus 
tests are contained in a separate test file. When a test number is selected, the FD 
constraint solver examines all constraints automatically and answers whether the 
selected execution is legal. By running the litmus tests we can learn the degree 
to which executions are constrained, i.e., we can obtain a general view of the 
global ordering relation between pairs of instructions. 
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4 
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8 



1 

o' 

0 

0 

1 

1 

1 

1 

1 



2 3 4 5 6 7 8 
1 1 0 0 0 0 0 
0 1 0 0 0 0 0 
0 0 0 0 0 0 0 
110 1110 
110 0 110 
1 1 0 0 0 1 0 
1 1 0 0 0 0 0 
1111110 



Fig. 4. A legal ordering matrix for the execution shown in Fig. 3 when requirePro- 
gramOrder is disabled. A value 1 indicates that the two operations are ordered. A 
possible interleaving 84567 123 is also automatically derived from this matrix. 



Consider, for example, the program discussed earlier in Fig. 1. Its instruc- 
tions are decomposed into operations as shown in Fig. 3. After taking this trace 
as input, the Prolog tool attempts all possible orders until it can find an instanti- 
ation that satisfies all constraints. For this particular example, it returns “illegal 
trace” as the result. If one comments out the requireProgramOrder rule and 
examines the trace again, the tool quickly finds a legal ordering matrix and a 
corresponding interleaving as shown in Fig. 4. Many other experiments can be 
conveniently performed in a similar way. Therefore, not only does this approach 
give people the notation to write rigorous as well as readable specifications, it 
also allows users to play with the model, asking “what if” queries after selec- 
tively enabling/disabling the ordering rules that are crucial to their work. We 
can also use the built-in predicate setof provided by Prolog to collect all legal 
return values for read operations. This is achieved by repeatedly backtracking 
and gradually building up a list of the solutions. 

Although translating the formal specification to Prolog is fairly straightfor- 
ward, there does exist some “logic gap” between predicate calculus and Prolog. 
Most Prolog systems do not directly support quantifiers. Therefore, we need to 
implement the effect of a universal quantifier by enumerating the related finite 
domain. The existential quantifier is realized by the backtracking mechanism of 
Prolog when proper predicate conditions are set. 



The SAT Approach 

As an alternative method, we use our Prolog program as a driver to emit proposi- 
tional formulae asserting the solvability of the constraints. After being converted 
to the DIMACS format, the final formula is sent to a SAT solver, such as ZChaf f 
[19] or berkmin [20]. Although the clause generation phase can be detached from 
the logic programming approach, the ability to have it coexist with FD-Prolog 
might be advantageous since it allows the two methods to share the same spec- 
ification base. 



Performance Results 

Performance statistics from some litmus tests is shown below. These tests are 
chosen from [2] and represented by their original table numbers. Results are 
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measured on a Pentium III 900 MHz machine with 256 MB of RAM running 
Windows 2000. SICStus Prolog is run under compiled mode. The SAT solver 
used is ZChaff. 



Test 


Result 


FD Solver (sec) 


Vars 


Clauses 


SAT(sec) 


CNF Gen (sec) 


[2, Table 5] 


illegal 


0.49 


64 


679 


0.01 


3.67 


[2, Table 10] 


legal 


3.00 


100 


1280 


0.01 


8.23 


[2, Table 15] 


illegal 


22.29 


576 


15706 


0.01 


211.76 


[2, Table 18] 


illegal 


2.40 


144 


2125 


0.01 


15.75 


[2, Table 19] 


legal 


4.81 


144 


2044 


0.01 


15.68 



Although satisfiability problems are NP-complete, the performance in prac- 
tice has been acceptable. For the method using SAT solvers, the clause genera- 
tion time is noticeably larger than the actual SAT solving time, since the entire 
formula is encoded at once through symbolic execution and is recursively sim- 
plified afterwards. Alternative boolean formula encoding techniques, such as the 
one discussed in [21], may help speed up this process. 



5 Conclusions 

The setting in which contemporary memory models are expressed and analyzed 
needs to be improved. Towards this, we present a framework based on axiomatic 
specifications (expressed in higher order logic) of memory ordering requirements. 
It is straightforward to encode these requirements as constraint logic programs 
or, by an extra level of translation, as boolean satisfiability problems. Our tech- 
niques are demonstrated through the adaptation and analysis of the Itanium 
memory model. Being able to tackle such a complex design also attests to the 
scalability of our framework for cutting-edge commercial architectures. 

Our methodology provides several benefits. First, the ability to execute the 
underlying model is a powerful feature that promotes understanding. Second, the 
compositional specification style provides modularity, reusability, and scalability. 
It also allows one to change constraints incrementally for investigation purposes. 
Third, the expressive power of the underlying logic allows one to define a wide 
range of requirements using the same notation, providing a rich taxonomy for 
memory consistency models. Finally, the method of converting axiomatic rules to 
a propositional formula allows one to perform property checking through boolean 
reasoning, thus opening up a new means to conduct memory model verification. 

Future Work. One possible enhancement is to develop the capability of exercis- 
ing symbolic {non-ground) litmus tests. Such a tool may be used to automatically 
synthesize critical instructions of concurrent code fragments comprising compiler 
idioms or other synchronization primitives. For example, one could imagine us- 
ing a symbolic store instruction in a program and asking a tool to solve whether 
it should be an “ordinary” or a “release” store to help generate aggressive code. 
Another area of improvement is in reducing the logic gap between the formal 
specification and the tools that execute the specification. One possibility is to 
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apply a quantified boolean formulae (QBF) solver that directly accepts quanti- 
fiers. The research of QBF solvers is still at a preliminary stage compared to 
propositional SAT. We hope our work can help accelerate its development by 
providing industrially motivated benchmarks. 
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Appendix: Formal Itanium Memory Ordering Specification 

legal ops = 3 order. 

requireLinearOrder ops order A require WriteOperationOrder ops order A 
requireProgramOrder ops order A 
requireMemoryDataDependence ops order A 

requireDataFlowDependence ops order A requireCoherence ops order A 
requireReadValue ops order A requireAtomicWBRelease ops order A 
requireSequentialUC ops order A requireNoUCBypass ops order 



A.l General Ordering Requirement 

requireLinearOrder ops order = 

requireWeakTotal ops order A requireTransitive ops order A 
requireAsymmetric ops order 

requireWeakTotal ops order = \/ i,j G ops. id i 7 ^ id j ^ (order i j V order j i) 

requireTransitive ops order = \/ i,j,k G ops. (order i j A order j k) ^ order i k 

requireAsymmetric ops order = \/ i,j G ops. order i j ^ -i(order j i) 

A. 2 Write Operation Order 

require WriteOperationOrder ops order = \/ i,j G ops. 
orderedByWriteOperation i j ^ order i j 

orderedByWriteOperation i j = isWr i A isWr j A wrID i = wrID j A 
(wrType i = Local A wrType j = Remote A wrProc j = p i V 
wrType i = Remote A wrType j = Remote A 
wrProc i = p i A wrProc j 7 ^ p i) 



A. 3 Program Order 

requireProgramOrder ops order = \/ i,j G ops. 

(orderedBy Acquire i j V orderedByRelease i j V orderedByFence i j) 
order i j 

orderedByProgram ij = pi = pj A pc i < pc j 

orderedByAcquire i j = orderedByProgram i j A op i = Id.acq 

orderedByRelease i j = orderedByProgram i j A op j = st.rel A 
(isWr i ^ (wrType i = Local A wrType j = Local V 
wrType i = Remote A wrType j = Remote A wrProc i = wrProc j)) 

orderedByFence i j = orderedByProgram i j A (op i = mf V op j = mf) 
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A. 4 Memory-Data Dependence 

requireMemoryDataDependence ops order = \/ i,j G ops. 

(orderedByRAW i j M orderedByWAR i j V orderedByWAW i j) ^ 
order i j 

orderedByMemoryData i j = orderedByProgram i j A var i = var j 
orderedByRAW i j = 

orderedByMemoryData i j A isWr i A wrType i = Local A isRd j 
orderedByWAR i j = 

orderedByMemoryData i j A isRd i A isWr j A wrType j = Local 

orderedByWAW i j = orderedByMemoryData i j A isWr i A isWr j A 
(wrType i = Local A wrType j = Local V 
wrType i = Remote A wrType j = Remote A 
wrProc i — p i A wrProc j — p i) 



A. 5 Data-Flow Dependence 

requireDataFlowDependence ops order = \/ i,j G ops. 
orderedByLocalDepencence i j ^ order i j 

orderedByLocalDepencence i j = orderedByProgram i j A reg i = reg j A 
(isRd i A isRd j V 

isWr i A wrType i = Local A useReg i A isRd j V 
isRd i A isWr j A wrType j = Local A useReg j) 

A. 6 Coherence 

requireCoherence ops order = \/ i,j G ops. 

(isWr i A isWr j A var i = var j A 

(attribute (var i) = WB V attribute (var i) = UC) A 

(wrType i = Local A wrType j = Local A p i ~ p j 

wrType i = Remote A wrType j = Remote A wrProc i = wrProc j) A 

order i j) 

(y p,q£ ops. 

(isWr p A isWr q A wrID p — wrID i A wrID q — wrID j A 
wrType p = Remote A wrType q — Remote A wrProc p = wrProc q) ^ 
order p q) 



A. 7 Read Value 

requireRead Value ops order = \t j £ ops. 

(isRd j ^ (validLocalWr ops order j V validRemoteWr ops order j V 
validDefaultWr ops order j)) A ((isWr j A useReg j) ^ validRd ops order j) 
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validLocalWr ops order j = 3 i G ops. 

(isWr i A wrType i = Local A var i = var j A p * = p j' A 
data i — data j A order i j) A 

(-1 3 fc G ops. isWr k A wrType k = Local A var k — var j A p k — p j A 

order i k A order k j) 

validRemoteWr ops order j = 3 i G ops. 

(isWr i A wrType i = Remote A wrProc i = p j A var i = var j A 
data j = data i A -i (order j i)) A 

(-1 3 fc G ops. isWr k A wrType k — Remote A var k = var j A wrProc k = p j A 
order i k A order k j) 

validDefaultWr ops order j = 

(-■ 3 i G ops. isWr i A var i — var j A order i j A 

(wrType i = Local A p i ~ p j V wrType i = Remote A wrProc * = p i)) 
data j — default (var j) 

validRd ops order j = 3 i G ops. 

(isRd i A reg i = reg j A orderedByProgram i j A data j — data i) A 
(-1 3 fc G ops. isRd k A reg k — reg j A 
orderedByProgram i k A orderedByProgram k j) 



A. 8 Total Ordering of WB Releases 

requireAtomicWBRelease ops order = V i,j,k£ ops. 

(op i = st.rel A wrType i = Remote A op k = st.rel A wrType k — Remote A 
wrID i = wrID k A attribute (var i) — WB A order i j A order j k) => 
(op j = st.rel A wrType j = Remote A wrID j — wrID i) 

A. 9 Sequentiality of UC Operations 

requireSequentialUC ops order = i,j £ ops. orderedByUC i j ^ order i j 
orderedByUC i j = 

orderedByProgram i j A attribute (var i) = UC A attribute (var j) = U C A 
(isRd i A isRd j V 

isRd i A isWr j A wrType j = Local V 
isWr i A wrType i = Local A isRd j V 

isWr i A wrType i = Local A isWr j A wrType j = Local) 



A. 10 No UC Bypassing 

requireNoUCBypass ops order = \/ i,j,k £ ops. 

(isWr z A ^vrT?ypG i — Local A 3 .ttribut© (vcir z) — UC A isR>d j A 
isWr k A wrType k = Remote A wrProc k — p k A wrID k = wrID i A 
order i j A order j k) 

(wrProc k ^ p j V var i 7 ^ var j) 
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Abstract. Several optimal algorithms have been proposed for the complementa- 
tion of nondeterministic Biichi word automata. Due to the intricacy of the problem 
and the exponential blow-up that complementation involves, these algorithms have 
never been used in practice, even though an effective complementation construc- 
tion would be of significant practical value. Recently, Kupferman and Vardi de- 
scribed a complementation algorithm that goes through weak alternating automata 
and that seems simpler than previous algorithms. We combine their algorithm with 
known and new minimization techniques. Our approach is based on optimizations 
of both the intermediate weak alternating automaton and the final nondeterminis- 
tic automaton, and involves techniques of rank and height reductions, as well as 
direct and fair simulation. 



1 Introduction 

Efforts for developing simple complementation algorithms for nondeterministic Biichi 
automata started early in the 60s, motivated by decision problems of second order log- 
ics. In [5], Biichi suggested a complementation construction that involved a complicated 
combinatorial argument and a doubly-exponential blow-up in the state space. Thus, com- 
plementing an automaton with n states resulted in an automaton with 2 states. In 
[22], Sistla, Vardi, and Wolper suggested an improved construction, with 2*^^" states. 
Only in [20], however, Safra introduced an optimal determinization construction, which 
also enabled a complementation construction, matching the known lower 

bound [18]. Another construction was suggested hy Klarlund in [10], which 

circumvented the need for determinization. While being the heart of many complexity 
results in verification, the constructions in [20,10] are complicated and difficult to pro- 
gram. We know of no implementation of Klarlund’s algorithm, and the implementation 
of Safra’s algorithm [24] has to cope with the involved structure of the states in the 
complementary automaton. 

The lack of a simple implementation is not due to a lack of need. In the automata- 
theoretic approach to verification, we check correctness of a system with respect to a 
specification by checking containment of the language of the system in the language of 
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an automaton that accepts exactly all computations that satisfy the specification. In order 
to check the latter, we check that the intersection of the system with an automaton that 
accepts exactly all the computations that violate the specification is empty. For instance, 
LTL model checking [15,25] usually proceeds by translating the negation of an LTL for- 
mula into a Biichi automaton. When properties are specified by w-regular automata, one 
needs to complement the property automaton. Due to the lack of a simple complementa- 
tion construction, the user is typically required to specify the property by deterministic 
Biichi automata [14] (it is easy to complement a deterministic automaton), or to supply 
the automaton for the negation of the property [9]. Similarly, specification formalisms 
like ETL [26], which have automata within the logic, involve complementation of au- 
tomata, and the difficulty of complementing Biichi automata is an obstacle to practical 
use [3] . In fact, even when the properties are specified in LTL, complementation is useful: 
the translators from LTL into automata have reached a remarkable level of sophistication 
(cf. [23,8]). Lven though complementation of the automata is not explicitly required, the 
translations are so involved that it is useful to checks their correctness, which involves 
complementation^ . Complementation is interesting in practice also because it enables 
refinement and optimization techniques that are based on language containment rather 
than simulation. Thus, an effective algorithm for the complementation of Biichi automata 
would be of significant practical value. 

In [12], Kupferman and Vardi describe a complementation procedure that is simpler 
than those in [20,10]. The key idea of [12] is to go via weak alternating automata. In an 
alternating automaton [6], both existential and universal branching modes are allowed, 
and the transitions are given as Boolean formulas over the set of states. Lor example, 
a transition S{q, ct) = V {q 2 A q^) means that when the automaton is in state q and 
it reads a letter a, it should accept the suffix of the word either from state qi or from 
both states q 2 and q^. Let a be the set of the automaton’s accepting states. In a weak 
automaton, each strongly connected component of the graph induced by the transition 
function is either accepting (trivial, or contained in a) or rejecting (its intersection with 
a is empty). Since the strongly connected components are partially ordered, each path in 
the run eventually gets trapped in one of them. The run is accepting if all paths get trapped 
in accepting components. The height of a weak automaton is the maximal number of 
alternations between accepting and rejecting components in a path in the graph of the 
automaton, plus one. 

The rich structure of alternating automata makes their complementation trivial — 
one only has to dualize the transition function and the acceptance condition. Removing 
alternation from Biichi automata involves a simple extension of the subset construction 
[19]. Unfortunately, by dualizing the given nondeterministic Biichi automaton, one gets 
a universal co-Biichi automaton, creating a gap in the construction. This gap is closed 
in [12], whose complementation construction consists of the following steps. 

( 1 ) Dualize the given nondeterministic Biichi automaton B, and obtain a universal co- 
Biichi automaton C for the complement language. This step is trivial and involves 
no blow up. 

* For an LTL formula r/), one typically checks that both the intersection of ,4^ with and the 
intersection of their complementary automata are empty. 
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(2) Translate C to an alternating weak automaton W accepting the same language. If C 
has n states, then W has 0{n?) states. 

(3) Translate W to a nondeterministic Biichi automaton This step follows the ex- 

ponential subset construction of [19]. The state space of Al can be restricted to 
consistent subsets^, making the overall blow up rather than 2*^^" \ 

In this paper we study and describe an arsenal of optimization techniques that can be 
applied in the last two steps. The techniques can be partitioned into the following classes. 

Rank Reduction. The translation in Step (2) is based on an analysis of the accepting 
runs of C. Each vertex of the run is associated with a rank in the range {0, . . . , 2n}. Like 
the progress measure of [10], the rank of a vertex indicates how easy it is to prove that 
all the paths that start at the vertex visit a only finitely often. The rank of a universal 
co-Biichi automaton C is the maximal rank of a vertex in an accepting run of C. 

If the state space of C is Q ^^d its rank is k, the state space of W can be restricted to 
Q X {0, . . . ,k}. Hence, finding and/orreducing therankofC is desirable. We study ranks 
of languages, namely the minimal rank of a universal co-Biichl automaton that recognizes 
their complement. We show that, surprisingly, the rank of all w-regular languages is 3 (a 
nice corollary, also proved in [16], is that all w-regular languages can be recognized by 
an alternating weak automaton of height 3). Reducing the rank to 3, however, has a flavor 
of determinization, and involves an exponential blow-up in the state space. Accordingly, 
we prefer the approach of finding the rank fc of C. We show that the rank of C is bounded 
by 2(n — |a| ), and that there are automata for which this bound is tight. As suggested in 
[12], the rank is often smaller. We find the rank by checking for language equivalence 
between >V and its restrictions to Q x {0, . . . , j}, for j < 2(n — |a|). 

Minimization of W. Once we found the rank fc of C and restrict the state space of W 
accordingly, we minimize W further. The transition function of W as described in [12] 
is of size \5\k‘^, where 5 is the transition function of C. It is suggested in [12] to simplify 
it and obtain a function of size 3|<5|fc. We simplify it further to 2|i5|fc. The simplification 
is based on simulation minimization we apply to W, and which often reduces the state 
space and the transitions even more. Our simulation relation is similar to the alternating 
simulation of [2], extended to automata with acceptance conditions on the states (direct 
simulation) as well as an extension of it in which acceptance conditions are moved to the 
arcs. Finally, we reduce the height of >V by repeatedly removing its minimal strongly 
connected component, as long as such a removal does not change its language. 

Minimization of Ai. Once Af is produced by the subset construction, we apply further 
simplification techniques to it. The first is the fair simulation minimization of [8], and 
the second is similar to the height reduction described for W, performed on the strongly 
connected components of Af . We note that the same reductions are applied also to the 
nondeterministic Biichi automaton B with which we start. 

As shown in [18], complementation of a nondeterministic Biichi automaton with 
n states may involve a blow up. Accordingly, we measure the efficiency of 

^ We describe the consistency condition in Section 3. 
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our optimizations by the following two criteria: (1) we would like the result of com- 
plementing a nondeterministic Biichi automaton derived from an LTL formula to be 
comparable with what we get by negating the formula and then translating to a nonde- 
terministic Biichi automaton. (2) we would like the result of complementing a nonde- 
terministic Biichi automaton twice to be comparable with the original automaton. We 
demonstrate the effectiveness of our construction by examining several examples for 
which our construction produces the minimal nondeterministic Biichi automaton. We 
have implemented our procedure as an extension of the Wring translator from LTL to 
Biichi automata [23,8], and our experimental results are reported in Section 7. 



2 Preliminaries 

Let (Q) denote the set of positive Boolean formulas over Q. An alternating automaton 
on infinite words A = {S, Q, Qm, S, a) consists of a hnite alphabet S, a hnite set of states 
Q, an initial state qin C Q, a transition function S : Qx S ^ (Q) , and an acceptance 

condition a C Q. For A = (S, Q, Qm, 5, a) and q £ Q, let A'^ = {S, Q, q, 6, a). That 
is, A‘^ is obtained from A by making q the initial state. 

A run of an alternating automaton ^ on a word cr G 27“ is a Q-labeled tree (T^, r), 
where Tr is a prefix-closed subset of N*, and r : Tr Q is a labeling function. A run 
of ^ on CT = ao, CTi, . . . satisfies the following conditions: (1) r(e) = qm- (2) For a tree 
node t £ Tj. such that r{t) = q and 6{q, crfi = (3, there is a subset Qt ^ Q that satisfies 
f3, and such that the successors of t are labeled by the elements of Qt. 

A run is accepting if all its infinite paths satisfy the acceptance condition. In a Biichi 
automaton, a path satisfies a if it intersects a infinitely often. In a co-Biichi automaton, 
a path satisfies a if it intersects it finitely many times. A word w £ 27“ is accepted by 
^ if ^ has an accepting run on w. The words accepted by A form the language of A, 
denoted by L{A). 

Complementation of an alternating automaton is accomplished by dualizing its tran- 
sition function, and changing the acceptance condition from Biichi to co-Biichi or vice 
versa. Dualization consists of exchanging A with V, and true with false in 5. 

A positive Boolean formula has a unique minimal DNF. Therefore 6{q,ai) £ B^ (Q) 
identifies a set of sets of states A{q, ai) C 2^ . For instance, if cto) = {qi A (52 V 
73)) V {qi A (72 A (74), then A{qo, ao) = {{gi, 72}, {71, ^s}}- The Boolean formulas true 
and false translate into {0} and 0, respectively. The choice of Qt C Q required by the 
definition of run can always be restricted so that Qt £ A{q, at). 

Nondeterministic automata are alternating automata in which each C £ A{q, 1) is a 
singleton for every q £ Q and I £ 27. Universal automata are alternating automata in 
which A{q, 1) is a singleton for every q £ Q and I £ 27. Deterministic automata are at 
the same time nondeterministic and universal. 

A maximal strongly connected component (MSCC) of a directed graph is a maximal 
subgraph such that each node in the subgraph has a path to every node in the subgraph. A 
trivial MSCC contains one node and no arcs. We assume that all the trivial MSCCs of an 
automaton are contained in o;. A weak alternating automaton is such that each MSCC of 
its transition graph is either disjoint from a or contained in it. From a weak alternating 
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automaton with co-Biichi acceptance A one obtains a weak alternating automaton with 
Biichi acceptance A' such that L(A') = L(A) simply by taking a' = Q \ a. 

We use three-letter abbreviations to designate types of automata: The first letter char- 
acterizes the transition structure and is one of “D” (deterministic), “N” (nondetermin- 
istic), “U” (universal), and “A” (alternating). The second letter identifies the acceptance 
condition and is one of “B” (Biichi), “C” (co-Biichi), and “W” (weak). Finally, the third 
letter designates the objects accepted by the automata; in this paper we are only con- 
cerned with “W” (infinite words). Hence, NBW designates a nondeterministic Biichi 
automaton, UCW designates a universal co-Biichi automaton, and AWW designates a 
weak alternating automaton, all on infinite words. 

3 Ranks and Complementation 

In this section we review the relevant technical details of [12]. Consider a UCW A = 
{S, Q, Qin, S, a) obtained by dualizing NBW B, and a word w. Let \Q\ = n. The run of 
Aonw can be represented by a directed acyclic graph (dag) Gr = (U, E), where 

- V C Q X N is such that {q, () G U iff there exists x G Tr with \x\ = I and r{x) = q. 

For example, {qin, 0) is the only vertex of in Q x {0}. 

-EC 1J;>q(Q X {[}) X {Q X {I + 1}) is such that E{{q, 1), {q' , I + 1)) iff there 

exists X G Tr with |x| = I, r{x) = q, and r{x ■ c) = q' for some c G N. 

We say that a vertex {q' , I') is a successor of a vertex {q, 1) iff E{{q^ 1), {q' , /')). We say 
that {q' , I') is reachable from {q, 1) iff there exists a sequence {qo,lo), . . . , {qi, U) of 
successive vertices such that {q, 1) = (go, lo), and {q' , I') = {qt, k). Finally, we say that 
a vertex {q, 1) is an a-vertex iff g G a. It is easy to see that (T^, r) is accepting iff all 
paths in Gr have only finitely many a- vertices. 

Consider a (possibly finite) dag G C Gr- We say that a vertex (g, /) infinite in G 
iff only finitely many vertices in G are reachable from (g, 1). We say that a vertex (g, i) 
is a-free in G iff all the vertices in G that are reachable from (g, 1) are not a-vertices. 
Finally, we say that the width of G is fc if fc is the maximal number for which there are 
infinitely many levels I such that there are k vertices of the form (g, 1) in G. Note that the 
width of Gr is at most n. Given an accepting run dag Gr, we define an infinite sequence 
Go C Gi C G 2 C ■ ■ ■ of DAGS inductively as follows. 

_ _ Q 

- G 2 z+i = G 2 i \ {(g, 1) I (g, 1) is finite in G 2 J. 

- G 2^+2 = G 2 t+i \ {(g, 1) I (g, 1) is a-free in Ga^+i}. 

It is shown in [12] that for every i > 0, the transition from G 21+1 to G 21+2 involves 
the removal of an infinite path from G 2 i+i. Since the width of Go is bounded by n, it 
follows that the width of G 2 i is at most n — i. Hence, G 2 „ is finite, and G 2 „+i is empty. 

Each vertex (g, 1) in Gr has a unique index i > 1 such that (g, 1) is either finite 
in G 2 i or a-free in G 2 i+i- Given a vertex (g, 1), we define the rank of (g, 1), denoted 
rank{q, 1), as follows. 



rank{q, 1) 



2i If (g, 1) is finite in G 2 i- 
2i -f 1 If (g, 1) is a-free in G 2 i+i- 




On Complementing Nondeterministic Biichi Automata 



101 



For k G N, let [k] denote the set {0, 1, . . . , k}, and let denote the set of odd 

memhers of [A:] . By the above, the rank of every vertex in is in [2n] . Recall that when 

the run is accepting, all the paths in G^ visit only hnitely many a- vertices. Intuitively, 
rank{q, 1) hints at how difficult it is to prove that all the paths of Gr that visit the vertex 
{q, 1) visit only hnitely many a-vertices. Easiest to prove are vertices that are hnite in 
Go- Accordingly, they get the minimal rank 0. Then come vertices that are a-free in the 
graph Gi- These vertices get the rank 1. The process repeats until all vertices get some 
rank. 

We say that an integer j is a required rank for a UCW A if there exists a word 
w G k^(A) such that some vertex in the run of ^ on re gets rank j. Then, the rank of A 
is the maximal rank required for A. The annotation of runs with ranks is used in order 
to translate UCW into AWW: 

Theorem 1. Let A be a UCW with n states and rank k. There is an AVTVT A! with 
n{k + 1) states such that L{A') = L{A). 

Proof. Let A = {S, Q, qm, S, a). We dehne A' = {S, Q', g'„, S', a'), where 

- Q' = Q X [k]. Intuitively, A' is in state {q,i}, if it guesses that in the accepting 
run of A on w, the rank of (q, 1) is i. An exception is the initial state explained 
below. 

- = {qin, k). That is, qin is paired with k, which is an upper bound on the rank of 
(<Zm) 0)- 

- We dehne S' by means of a function release : B^{Q) x [k] — >■ Given a 

formula 9 G B^{Q), and a rank i G [A;], the formula release(9, i) is obtained from 
6 by replacing an atom q by the disjunction example, release{qs A 

95,2) = ((g3,2) V ((73, 1) V (93, 0)) A ((^5,2) V (95, 1) V (55, 0)). 

Now, S' : Q' X S ^ B^(Q') is dehned, for a state (q,i) G Q' and a G U, as 
follows. 



S'{{q,i),<j) 



release{S{q, cr), i) If q ^ a or A is even, 
false \f q G a and i is odd. 



That is, if the current guessed rank is i then, by employing release, the run can 
move to its successors at any rank that is smaller than i. If, however, q G a and 
the current guessed rank is odd, then, by the dehnition of ranks, the current guessed 
rank is wrong, and the run is rejecting. 

- a' = Q X [k]°‘^‘^. That is, inhnitely many guessed ranks along a path should be odd. 

To see that the automaton A' is weak, note that each set Q x {i } is a collection of strongly 
connected components that agree on their classiheation as accepting or rejecting. Indeed, 
membership in a' depends on the parity of i, and the transitions in S' are such that from 
a state in Q x {z} the automaton A' proceeds only to states in Q x {j}, for j < i. □ 



Once we know how to translate UCW to AWW, complementation is reduced to 
removal of alternation from ABW (recall that AWW are a special case of ABW). In [19], 
Miyano and Hayashi describe such a translation. We present (a simplihed version of) 
their translation in Theorem 2 below. 
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Theorem 2. [19] Let A be an alternating Biichi automaton. There is a nondeterministic 
Biichi automaton A! , with exponentially many states, such that L{A') = L{A). 

Proof. The automaton A' guesses a run of A. At a given point of a run of A' , it keeps 
in its memory a whole level of the ran tree of A. As it reads the next input letter, it 
guesses the next level of the ran tree of A. In order to make sure that every infinite path 
visits states in a infinitely often, A' keeps track of states that “owe” a visit to a. Let 
A = {S,Q,qinA,C(). Then^' = (27,2*3 x 2*3, ({gj„}, 0), <5', 2*3 x {0}), where S' is 
defined, for all {S, O) G 2*3 x 2*3 and cr G 27, as follows. 

- If O 0, then S'{{S,0),a) = {{S', O' \ a) | S" satisfies /\^^gS{q,a),0' C 

S', and O' satisfies /\g^Q S{q,a)}. 

- If O = 0, then 6'{{S, 0),a) = {{S' , S' \ a) \ S' satisfies Ages *^)}- 

□ 

For an NBW B, the rank of B is the rank of its dual UCW. Complementing an NBW 
B with n states and rank k, its dual UCW has n states and rank k as well, the AWW 
W constructed in Theorem 1 has 0{nk) states, and the final NBW Ai constructed in 
Theorem 2 has 2*^*^"^^ states. By [18,20], however, an optimal complementation con- 
struction for nondeterministic Biichi automata results in an automaton with 2 *^("^°s") 
states, which may be smaller. Let B = (27, Q, qin, S, a). Consider a state {S, O) of Ai. 
Each of the sets S and O is a subset of Q x [k]. We say that P C Q x [fc] is consistent 
iff for every two states {q, i) and {q' , i') in P, if q = q' then i = z'. It is shown in [12] 
that restricting the state space of Ai to pairs {S, O) for which S' is a consistent subset 
of (5 X [k] is allowable; that is, the resulting Ai still complements B. Since there are 
20 (niogfe) consistent subsets of Q x [k], we have the following. 

Theorem 3. Let A be an NBW with n states and rank k. There is an NBW A' with 
20 (niogfe) states such that L{A') = comp{L{A)). 

4 Ranks of Automata and Languages 

Consider a UCW A with n states and a word w G 27“. Let Gq, Gi, . . . , G 2 n+i be the 
sequence of dags constructed in [12] for the run of A on w. Recall that the transition 
from G 2 i+i to G 21+2 involves a removal of an infinite path from G 2 i+i, which is why 
the width of G 21 is at most n — z. As noted to us by Doron Bustan, all the vertices in 
the removed path are not a-vertices. Hence, one could argue that the n — i bound on the 
width of G 2 i holds also for a tighter definition of width: let the a-less width of Gi be 
the maximal number k for which there are infinitely many levels I such that there are k 
vertices not in a of the form {q, 1). With this tighter definition, the a-less width of Gq is 
bounded by rz — \a\, implying that the a-less width of G 21 is at most n — (|a| -I- z) - In 
particular, the a-less width of G 2 {n-\a\) is at most 0. Hence G 2 (n-\a\) has only finitely 
many vertices that are not a-vertices. Since Gq is accepting, then, by Konig’s Lemma, 
G 2 (n-\a\) also has only finitely many a vertices. It follows that G 2 (n-\a\) is finite, 
implying that all vertices get ranks in 0, . . . , 2(rz — |a|). 

In practice, the transition from G 2 i to G 2 i +2 often reduces the width by more than 
one. One may wonder whether it is possible to tighten the analysis above even more in 
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order to show that a rank of 2(n — |a|) is never required. Recall that an integer j is a 
required rank for A if there exists a word w G £(A) such that some vertex in the run of 
^ on ru gets rank j. Equivalently, the a-less width of Gj (with Gq being the run dag of 
A on w) is strictly larger than 0. As follows from Theorems 1 and 3, the rank of A plays 
an important role in the sizes of equivalent AWW and NEW for it. It is shown in [12] 
that the problem of finding the rank of a UCW A is PSPACE-complete. By the above, 
the rank of A is at most 2{n — |a|). By the following theorem, there are cases in which 
this bound is tight. 

Theorem 4. There is a family Ai, A 2 , ■ ■ ■ of UCW such that An has n + 1 states, 
acceptance set of size 1, and rank 2n. 

We now turn to study ranks of w-regular languages. Eor an w-regular language C, 
we say that the rank of £ is fc iff there is a UCW of rank at most k for comp{C). It is 
tempting to think that ranks induce an infinite hierarchy TZq C TZi C • • • of languages, 
with TZi containing all languages of rank i. We show that the hierarchy collapses at TZs 
(that is, all w-regular languages have rank at most 3) and characterize its four levels. Eor 
a definition of safety and co-safety languages, see [1,21]. 

Theorem 5. TZ^ = u-regular languages, 7^2 = DBW, TZ\ = co-safety languages, and 
TZq = safety languages. 

The hierarchy induced by ranks is closely related to a hierarchy induced by heights 
of AWW. Intuitively, the height of an AWW is the number of accepting and rejecting 
layers it has. Formally, the height of an AWW A is the number of alternations between 
accepting and rejecting components in the graph of A, plus one, where the constants 
true and false are counted as accepting and rejecting components, respectively. For an 
integer k, let AWW[fc] denote the set of AWW of height at most k, or the w-regular 
languages accepted by such automata. Theorem 5 implies Theorem 6 below, which was 
proved first in [16]. Note that Theorem 5 is stronger than Theorem 6 and does not follow 
from it. 

Theorem 6. AWW[3] = oj-regular languages, AWW[2] = DBWiJ DCW, WAWW[1] 
= safety or co-safety languages. 

The results in this section imply that procedures for rank reduction that modify the 
given UCW are much stronger than those that calculate its rank. On the other hand, the 
reduction of the rank to 3 involves determinization, which we are trying to avoid, and 
which may cause an exponential blow-up. In view of this trade-off between the size of 
UCW and their ranks, our efforts focus on calculating the rank of the given UCW, rather 
than on modifying it. 

5 Simplifying Alternating BUchi Automata 

The construction of Theorem 2 may cause an exponential blow-up. Hence, before ap- 
plying it, we try to simplify the AWW W in three ways: by simulation minimization, by 
computing the rank of the UCW C, and by removing redundant MSCCs. 
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5.1 Simulation Minimization 

We recall that for an ABW A{q, 1) is a set of sets. Each member of Z\(( 7 , 1) is a conjunction 
of states. We define simulation between alternating automata in terms of a game as in 
[2]. Let Aa = {S, Qa, QiA, 5 a, cca) and Ap = {S, Qp, qtP, Sp, ap) be two ABWs; 
automaton Ap simulates automaton Aa if. given players P and A, P has a winning 
strategy for the following game. The positions of the game are the elements of Qa x Qp ; 
the initial position is {qiA, Qip), and the possible successors of a position (sa, sp) are 
all pairs (tA,tp) obtained by application of the following rule: 

- A chooses a letter I G S and a set of states Ca G Aa{sa, 1)', 

- P chooses a set of states Cp G Ap(sp, I); 

- A chooses tp G Cp\ 

- P chooses tA G Ca- 

A player who has to choose from an empty set loses. If this never happens, the play 
is infinite. The winner of an infinite play depends on whether one considers direct 
simulation or fair simulation. For direct simulation, A wins iff for some position (sA,sp) 
encountered, sa G a a and sp ^ ap. For fair simulation, A wins iff there are infinitely 
many positions such that sa G a a, but only finitely many positions such that sp G ap. 
P wins if A does not. As in the case of NBWs, direct simulation implies fair simulation, 
and fair simulation implies language containment; the converse is not true [2]. 

Theorem?. Let A = {S,Q,qin,S,a) and A' = (27, Q', <5', a') be two ABWs. 

If qin direct simulates gr'„, then qin fair simulates g'„. If qin fair simulates then 

L{A) D L{A'). 

If two states qi and q 2 are such that each simulates the other, we say that qi and <72 
are simulation equivalent. Two ABWs are simulation equivalent if their initial states are. 
Of particular interest to us is the case in which the two automata are and .4*^ for 
qi,q 2 G Q; that is, we are interested in the simulation relation on the states of ABW A. 
The “layered” structure of the AWW W implies the existence of a nontrivial simulation 
relation. 

Theorem 8. Let A = (27, Q, q^, 5, a) be a UCW with rank k; let A' be the equivalent 
AWW of Theorem I. Then, for every {q, j) G Q y. {Q, . . . ,k} and z G {0, . . . , j}, if j is 
even or q ^ a, then {q, j) fair simulates {q, i) in A! . If in addition j is odd or i is even, 
then {q,j) direct simulates {q,i). 

The simulation of Theorem 8 allows us to improve on [12, Remark 4.2] and reduce 
the size of the transition relation of W from 3|(5|A: to 2|<5|fc, where 5 is the transition 
function and k is the rank of the UCW C. 

Theorem 9. If in Theorem I, release{9, i) is redefined so that an atom q is replaced by 
{q, i) y {q,i — 1) if i > 0, and by {q, 0) for i = 0, then L{A') = L{A). 

In general, simulations between states of an ABW can be used to merge states (in 
case of simulation equivalence), remove transitions, or simplify transitions.^ The last 

^ This is in contrast to [7], which only considers simulation equivalence quotients. Besides, its 
model of alternating automata with existential and universal states makes even direct simulation 
unsafe for minimization. 
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use is specific to alternating automata: Suppose C G A{qi,aj) contains two states in 
direct simulation relation. Then, the simulating one can be removed because acceptance 
from the simulated state guarantees acceptance from the simulating one. 

Theorem 10. Let A = {S, Q, qin, S, a) be an ABW. Let q\ and <72 be two states in Q 
such that q 2 direct simulates q\. Suppose {( 71 ,( 72 } SI C G A{q,l), for some q G Q 
and I G S. Then the automaton A' obtained from A by replacing C in A{q,l) with 
C' = C \ {q 2 } is direct simulation equivalent to A. 

Theorem 11. Let A = {S, Q, q^, S, a) be an ABW. Let Ci,C '2 G A(q, 1), for some 
q G Q, I G S. Suppose that C\ f C 2 , and that V( 7 i G Ci, 3q2 G C 2 such that qi 
direct simulates ( 72 . Then the automaton A' obtained from A by replacing A{q, 1) with 
A'(q, 1) = A(q, 1) \ {C 2 } is direct simulation equivalent to A 

Two simulation equivalent states qi and (72 are merged by the following steps: (1) for 
every letter / , (5 (( 71 , t ) is replaced by(5((7i,Z)Vh((72,0;(2)(72is replaced by (71 throughout 
(3) qi is made initial if (72 is; (4) (72 is dropped. 

Corollary 1. Let A = {S, Q, qin, 5, a) be an ABW. If two states ( 71 , (72 G Q are di- 
rect simulation equivalent, the automaton obtained by merging q\ and (72 is simulation 
equivalent to A. 

The computation of the direct simulation relation is based on the following observa- 
tion [2]. Let S' be a simulation relation on the states of an ABW over alphabet S. Then 
{u, v) G S implies 

MlGU.yC G A{u, 1) . 3C' G A{v, 1) . W gC' .3u' gC .{u', v') G S . 

We can therefore compute the direct simulation relation as a greatest fixpoint by starting 
with all the pairs of states (u, v) such that acceptance of u implies acceptance of v, and 
removing pairs that violate the condition above. 

5.2 Simulation with Accepting Arcs 

The definition of direct simulation given in Section 5.1 assumes that M G a implies G a. 
However, we may compute a larger relation by considering the acceptance conditions to 
be on the arcs. Let every set of states C G A(q, 1) be a transition out of q G Q enabled 
hy I G S. An arc of transition C is the pair {q, q'), for some state q' G C. An arc (g, q') 
is accepting if q' G a. We can modify the definition of direct simulation as follows. 
Player A wins an infinite play if for some position (s^, sp), the arc (sajIa) of Ca is 
accepting, but the arc (sp, tp) is not. Player P wins if A does not. 

This approach may lead to simplifications not allowed by the original definition of 
direct simulation. However, Theorems 10 and 1 1 do not hold when acceptance conditions 
are moved to the arcs. Consider an AWW with S = {0}, Q = (a, b}, qin = a, 
S(a, 0) = a A b, S(b, 0) = b, and a = (aj. Here b direct simulates a when acceptance 
is on the arcs. In this case the only accepting arc is the self-loop on a. However, 6{a, 0) 
cannot be simplified to a lest the language changes from empty to 27“ . To obviate this 
problem, while computing the direct simulation relation with accepting arcs, we mark 
all the arcs that are used to justify the relation itself. We then allow simplification of a 
transition according to Theorem 10 only if the arcs to be removed are not marked. 
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5.3 Simplification Based on Language Containment 

Theorem 8 gives conditions under which {q,j) simulates {q,i) for j > i. However, 
no such general result can be proved for j < i. To determine the rank of the UCW C 
obtained by dualization of the given NBW B, and hence the required height of the AWW 
W, we resort to a language containment check. Specifically, since the rank is bounded by 
2(n — |a|), we apply the construction of Theorem 1 with k = 2{n — \a\) to build AWW 
W' such that L(W') = L{C). The construction of Theorem 2 applied to W' yields Ad'. 

To check whether k G {0, 2, . . . , 2(n — |o;| — 1)} is the rank of C, we restrict W' to 
Q X {0, . . . , fc}, make (qin, k) initial, and call the result W". We then obtain an AWW 
V for comp{L(W”)) by dualization of W", and apply Theorem 2 to it to produce Af". 
Since we know that L(W") C L(W'), if the intersection of Af' and Af" is empty, then 
k is an upper bound on the rank of C. If one tries the possible values of k in increasing 
order, the first time the intersection is empty, k is the rank of C, and W = W". It is 
important to note that the restriction to consistent subsets is allowed when converting W 
to NBW, but is not allowed when converting T>. This makes the determination of the rank 
a particularly expensive operation. To partially offset this cost, simulation minimization 
is always applied to T> before the subset construction. 

The language-containment approach can be used to further simplify W. Specifically, 
we try to remove an MSCC from W, and all the transitions with at least one destination 
state in the chosen MSCC. This guarantees that the language of the resulting automaton 
is contained in the language of the original one. A single language containment check 
then suffices to check whether the language remains the same. The MSCCs are examined 
in topological order from terminal to initial. If the language does not change, the removal 
of the MSCC is greedily accepted. We refer to this process as pruning the AWW. 



5.4 Simplification Procedure 

If the NBW B is weak, so is the UCW C. Hence, the construction of Theorem I is 
not required, because a UWW is a special case of AWW. Since B has been minimized, 
no further simplification of W = C is attempted. Testing this special case avoids the 
potentially expensive simplification of W and makes complementation of NWB efficient. 
This is practically relevant because many natural specifications induce weak automata 
[1 1,4]. (In [17] it is shown that the intersection of ACTL and LTL is UCW[I], which is 
included in UWW.) 

If C is not weak, first its rank is determined, and W is built accordingly, simplifying 
transitions as discussed in Section 5.2, and applying Corollary 1, and Theorems 10-11. 
The states with index 0 are included only if C has at least one transition equal to true. 
(Otherwise, no accepting path can visit them.) Pruning based on language containment 
(see Section 5 .3) is then performed as the last optimization of the AWW before computing 
the NBW equivalent to W. 

If is a DBW that is not weak, the resulting AWW is an NWW, and the subset 
construction does not change it. In such a case, our algorithm behaves like the one of 
[13]. In some cases, simplification of an AWW also produces an NWW, making the 
subset construction redundant. 
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6 Simplification of Nondeterministic BUchi Automata 

The complementation algorithm starts and ends with two NBWs, B and Ai. It is im- 
portant to minimize both. For B, every simplification is likely to alleviate the burden 
for the successive stages of the computation. For Ai, minimization recovers inefficien- 
cies due, in particnlar, to the subset construction. In this section we describe how this 
minimization is carried ont. Two procednres are applied to the NBWs B and Ai. One 
is fair simulation minimization [8]. The other is a prnning techniqne akin to the one 
described in Section 5.3, but based on checking direct simulation, rather than language 
containment. Its objective is to reduce the height of the NBW, and it works as follows. 

1 . Mark all states simulated by an initial state as initial. 

2. Process MSCCs that intersect a in topological order from sources to sinks. 

3. Remove arcs ont of MSCC and compnte simnlation relation for resnlt. 

4. If initial states with path to MSCC are simnlated by initial states without a path to 
the MSCC, make all the states in the MSCC non-accepting. 

5. Minimize automaton if some MSCCs were made non-accepting; otherwise, make 
non-initial all states that were made initial in the first step. 

We rely on the fact that direct simnlation minimization removes from the initial states 
a state that is simulated by another initial state. Hence, we end up with only one initial 
state if we started with one. 



7 Experimental Results 

We have implemented the complementation algorithm presented in this paper as an 
extension of the Wring translator from LTL to Biichi automata [23,8], which is written 
in Perl. All experiments were run on an IBM IntelliStation running Linux with a 1.7 
GHz Pentium 4 CPU and 1 GB of RAM. Complementation experiments were allotted 
1 minute if the input NBW was weak, and 2 minutes if it was not. 

We use a set of 1 000 LTL formulas distributed with Wring to evaluate the complemen- 
tation algorithm. Two types of comparisons were condncted. In the first, each formula 
is converted by Wring into a Biichi automaton whose complement is then computed 
if it has exactly one fairness constraint. (Wring prodnces generalized Biichi automata, 
which may have 0, 1, or more sets of accepting states. Onr implementation of the com- 
plementation algorithm only deals with one set of accepting states.) The complement is 
compared to the automaton obtained by translating the negation of the LTL formnla. In 
the second comparison, the automaton obtained from an LTL formula is compared to the 
complement of its complement. Table 1 snmmarizes our results with regard to the quality 
of the automata produced by the complementation algorithm. For the two experiments, 
the table reports the ratios of total numbers of states and transitions produced by the 
complementation procedure and those in the reference automata. 

Several steps in the translation from LTL to antomaton are order dependent. Since 
Wring’s data structnres heavily rely on hash tables, even minimal differences in two 
runs like the addition of a diagnostic print command may cause some differences in the 
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Table 1. Our complementation procedure produces small automata 



experiment 


states 


trans. 


negation 


1.09 


1.26 


double complementation 


1.13 


2.23 



Table 2. Experimental results 



method 


weak 


timeouts 

weak 


strong 


timeouts 

strong 


time 


states 


trans. 


M opt. 
ratio 


W 

states 


base 


406 


215 


67 


56 


47303 


4.08 


7.05 


6.03 


2901 


+w 


404 


4 


70 


60 


9556 


5.96 


14.03 


31.82 


4636 


+t9 


405 


4 


69 


49 


7672 


6.07 


13.67 


60.22 


2495 


+ds 


405 


4 


68 


53 


10233 


5.96 


13.36 


2.11 


2907 


+lc 


405 


3 


69 


59 


9240 


6.02 


13.52 


53.05 


2309 


-Ic 


405 


4 


68 


38 


6263 


6.48 


14.93 


49.12 


3536 


-hr 


405 


3 


68 


39 


6129 


6.38 


14.71 


1.94 


3603 


-arc 


404 


4 


69 


53 


6267 


5.95 


13.36 


1.65 


2456 


all 


406 


3 


68 


39 


6568 


6.02 


13.83 


6.82 


2470 



Table 3. Definition of methods compared in Table 2 



method 


B sim 


weak test 


Thm. 9 


C bound 


C rank 


W arc 


W sim 


Wlc 


M hr 


M sim 


base 


x/ 








x/ 








x/ 


x/ 


+w 


v' 


x/ 






x/ 








x/ 


x/ 


+t9 


v' 


x/ 


V 




x/ 








x/ 


x/ 


+ds 


x/ 


x/ 






x/ 


x/ 


x/ 




x/ 


x/ 


+lc 


x/ 


x/ 






x/ 






x/ 


x/ 


x/ 


-Ic 


y 


x/ 


V 






x/ 


x/ 




x/ 


x/ 


-hr 


y 


x/ 


V 


v' 




x/ 


x/ 






x/ 


-arc 


x/ 


x/ 


V 




x/ 




x/ 


x/ 


x/ 


x/ 


all 


v' 


x/ 


V 




x/ 


x/ 


x/ 


x/ 


x/ 


x/ 



Table 4. Feature description 



feature 


description 


section 


B sim 


fair simulation minimization of B 


6 


weak test 


simplified treatment for weak B 


5.4 


Thm. 9 


reduce the number of transitions of W 


5.1 


C bound 


use of 2(n — |a| ) as bound for the rank of C 


4 


C rank 


exact computation of the rank of C 


5.3 


W arc 


simulation minimization of W with accepting arcs 


5.2 


W sim 


direct simulation minimization of W 


5.1 


VVlc 


removal of MSCCs hy language containment 


5.3 


M hr 


height reduction of M 


6 


M sim 


fair simulation minimization of A4 


6 
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results. Hence, the number of automata with one set of accepting states presents small 
fluctuations in the various experiments. The same applies to most quantities we report. 

Table 2 compares variants of the complementation algorithm ranging from the basic 
procedure presented in [ 1 2] (base) to the procedure that implements all the improvements 
described in this paper (all). Table 3 defines all variants in terms for their features, and 
Table 4 summarizes each feature used to define the methods and refers to the section of 
this paper that discusses it. 

The first column of Table 2 designates the algorithm variant. Columns weak and 
timeout weak report the number of automata from those with one accepting set that 
were found to be weak and how many of those timed out. Columns strong and timeout 
strong do the same for the automata that were not weak. The next column gives the 
total CPU time in seconds. Columns 7 and 8 give the average number of states and 
transitions in A4 for the cases that completed. For comparison, the average numbers of 
states and transitions of the input automaton B are 6.04 and 12.23, respectively. The last 
two columns report the average ratio between the size of before and after optimization 
opt. ratio), and the total number of states of the AWWs. 

A few observations can be made about the data in Table 2. First, checking the input 
automaton B for weakness is a simple way to dramatically improve performance. How- 
ever, method w-t, that adds this simple check to the base approach, can only complete 
10 automata that are not weak: Though there seems to remain considerable room for 
improvement in the complementation of automata that are not weak, the optimizations 
presented in this paper triple the number of successes. 

Comparing the average sizes of the automata obtained with the several variants is 
hindered by the fact that the largest automata tend to cause the most timeouts. Comparing 
variants that produce about the same number of timeouts, however, shows that more 
optimization tends to produce smaller automata. It is also instructive to examine the 
effects of optimization of the NB W produced by the subset construction of Theorem 2. 

The variants that skip direct simulation minimization of the AWW W have higher AI 
opt. ratios because the final optimization has to make up for the “sloppiness” of the 
preceding stage. While fair simulation minimization of AI discharges its duties well, 
minimization of W leads to a more robust solution. 

Acknowledgment. Doron Bustan called our attention to the improved bound on the 
rank of a UCW. 
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Abstract. In formal verification, we verify that a system is correct with respect 
to a specification. Even when the system is proven to be correct, there is still a 
question of how complete the specification is, and whether it really covers all 
the behaviors of the system. The challenge of making the verification process 
as exhaustive as possible is even more crucial in simulation-based verification, 
where the infeasible task of checking all input sequences is replaced by checking 
a test suite consisting of a finite subset of them. It is very important to measure the 
exhaustiveness of the test suite, and indeed, there has been an extensive research in 
the simulation-based verification community on coverage metrics, which provide 
such a measure. It turns out that no single measure can be absolute, leading to 
the development of numerous coverage metrics whose usage is determined by 
industrial verification methodologies. On the other hand, prior research of coverage 
in formal verification has focused solely on state-based coverage. In this paper we 
adapt the work done on coverage in simulation-based verification to the formal- 
verification setting in order to obtain new coverage metrics. Thus, for each of the 
metrics used in simulation-based verification, we present a corresponding metric 
that is suitable for the setting of formal verification, and describe an algorithmic 
way to check it. 



1 Introduction 

Today’s rapid development of complex hardware designs requires reliable verification 
methods. Informal verification, we verify the correctness of a design with respect to a 
desired behavior by checking whether a labeled state-transition graph that models the 
design satisfies a specification of this behavior, expressed in terms of a temporal logic 
formula or a finite automaton [CGP99]. Beyond being fully-automatic and reliable, an 
additional attraction of formal-verification tools is their ability to accompany a negative 
answer to the correctness query by a counterexample to the satisfaction of the specifi- 
cation in the design [CGMZ95]. On the other hand, when the answer to the correctness 
query is positive, most formal-verification tools terminate with no further information 
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to the user. Since a positive answer means that the design is correct with respect to the 
specification, this seems like a reasonable policy. In the last few years, however, there 
has been growing awareness of the importance of suspecting the design of containing 
an error also in the case verification succeeds. The main justification for such suspi- 
cion are possible errors in the modeling of the design or of the behavior, and possible 
incompleteness in the specification. 

Several sanity checks have been suggested for further assessment of the modeling of 
the design and the specification [Kur97] . One direction is to detect vacuous satisfaction 
of the specification [BBER01,KV03,PS02], where cases like antecedent failure [BB94] 
make parts of the specification irrelevant to its satisfaction. For example, the specifica- 
tion “every request is eventually granted” is vacuously satisfied in a design in which no 
requests are sent. A similar direction is to check the validity of the specification (a speci- 
fication is valid if it holds for all designs). Clearly, vacuity or validity of the specification 
suggests some problem. It is less clear how to check completeness of the specification. 
Indeed, specifications are written manually, and their completeness depends entirely 
on the competence of the person who writes them. The motivation for a completeness 
check is clear: an erroneous behavior of the design can escape the verification efforts if 
this behavior is not captured by the specification. In fact, it is likely that a behavior not 
captured by the specihcation also escapes the attention of the designer, who is often the 
one to provide the specification. 

The challenge of making the verification process as exhaustive as possible is even 
more crucial in simulation-based verification. Each input vector for the design induces 
a different execution of it, and a design is correct if it behaves as required for all possible 
input vectors. Checking all the executions of a design is an infeasible task. Simulation- 
based verification is traditionally used in order to check the design with respect to 
some input vectors [BFOO]. The vectors are chosen so that the verification would be 
as exhaustive as possible, but still, design errors may escape the verification process. 
Since simulation-based verification is a heuristic that replaces the infeasible task of 
checking all input vectors, it is very important to measure the exhaustiveness of the 
input sequences that are checked. Indeed, there has been an extensive research in the 
simulation-based verification community on coverage metrics, which provide such a 
measure [TKOl]. Coverage metrics are used in order to monitor progress of the verifica- 
tion process, estimate whether more input sequences are needed, and direct simulation 
towards unexplored areas of the design. Essentially, the metrics measure the part of the 
design that has been activated by the input sequences. For example, in code-based cov- 
erage metrics, the design is given as a program in some hardware description language 
(HDL), and one measures the number of code lines executed during the simulation. In 
Section 3, we survey the variety of metrics that are used in simulation-based verification 
(see also [ZHM97,Dil98,Pel01,TK01]). Coverage metrics today play an important role 
in the design validation effort [Ver03]. 

Measuring the exhaustiveness of a specification in formal verification (“do more 
properties need to be checked?”) has a similar flavor as measuring the exhaustiveness 
of the input sequences in simulation-based verification (“are more sequences need to 
be checked?”). Nevertheless, while for simulation-based verification it is clear that cov- 
erage corresponds to activation during the execution on the input sequence, it is less 




Coverage Metrics for Formal Verification 



113 



clear what coverage should correspond to in formal verification, as in model checking 
all reachable parts of the design are visited. Early work on coverage metrics in formal 
verification [HKHZ99,KGG99] suggested two directions. Both directions reason about 
a finite-state machine (FSM) that models the design. The metric in [HKHZ99], later 
followed by [CKV01,CKKV01,CK02], is based on mutations applied to the FSM. Es- 
sentially, a state s in the FSM is covered by the specification if modifying the value of a 
variable in the state renders the specification untrue. The metric in [KGG99] is based on a 
comparison between the FSM and a reduced tableau for the specification. See [CKVOl] 
for a discussion of pros and cons of this metric. 

Coming up with an exhaustive specification is of great importance and challenge in 
formal verification. Sanity checks have been helpful in detecting design errors that es- 
cape the verification process [BBER01,HKHZ99,PS02]. The main lesson to be learned 
from several years of research in coverage in simulation-based verification [PelO 1 ,TK0 1 ] 
is that coverage is a heuristic that measures the exhaustiveness of the verification effort, 
but no single measure can be absolute. Consequently, research in simulation-based cov- 
erage has identified numerous coverage metrics; their usage is determined by practical 
verification methodologies. Prior research of coverage in formal verification [HKHZ99, 
KGG99,CKV01,CKKV01,CK02] has focused solely on state-based coverage. In con- 
trast, in simulation-based coverage one finds many other coverage metrics, including 
several metrics of code coverage, which measure that all syntactic aspects of the design 
have been covered [Pel01,TK01]. Our goal in this paper is to adapt the work done on 
coverage in simulation-based verification to the formal-verification setting in order to 
obtain new coverage metrics. Thus, for each of the metrics used in simulation-based 
verification, we present a corresponding metric that is suitable for the setting of formal 
verification. In addition, we describe symbolic algorithms for computing each of the 
new metrics. 

The adoption of metrics from simulation-based verification is not straightforward. To 
see this, consider for example code-based coverage and a check whether both branches 
of an if statement have been executed during the simulation. A straightforward adoption 
would check the satisfaction of the specification in a mutant design, one for each branch, 
in which the branch is disabled. Such a mutant design, however, has less behaviors than 
the original design, and would clearly satisfy all universal specifications (i.e., specifi- 
cations that apply to all behaviors, as in linear temporal logic) that are satisfied by the 
original design. In general, the problem we are facing is the need to assess the role a 
behavior has played in the satisfaction of a universal specification - one that is clearly 
satisfied in the design obtained by removing this behavior. The way we suggest to do so 
is to check whether the specification is vacuously satisfied in a mutant design in which 
this behavior is disabled: a vacuous satisfaction of the specification in such a design (we 
assume that the specification is not vacuously satisfied in the original design) indicates 
that the specification does refer to this behavior; on the other hand, a non- vacuous satis- 
faction of the specification in the mutant design indicates that the specification does not 
refer to the missing behavior. Accordingly, some of the new metrics we suggest reduce 
coverage to queries about vacuous satisfaction. On the other hand, a code-based metric 
that checks whether a particular assignment in the code has been executed may also 
be reduced to a metric that checks the satisfaction of the specification in a mutant de- 
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sign in which the assignment is changed. Accordingly, some of the metrics we suggest 
follow the approach in [HKHZ99] and reduce coverage to queries about satisfaction 
of the specification in mutant designs. Unlike previous work, however, the mutant de- 
signs we consider are not arbitrary, and capture the different metrics of coverage used 
in simulation-based verification. 

Due to lack of space, this version misses many technical details. A fuller version can 
be found at the authors’ URLs. 

2 Preliminaries 

2.1 Simulation-Based Verification 

In simulation-based verification, the implementation of a hardware design is executed 
in parallel with a reference model described at a different level of abstraction or with 
monitors and assertions that check for certain behavior of the implementation [KN96]. 
The execution is done with respect to a selected set of finite input sequences, referred to 
as tests. Thus, assuming the implementation has a set I of input signals, a test is a finite 
sequence f = zq, U, . . . , G (2^)* of input assignments. Implementations of hardware 
designs can be described by different formalisms. We consider two formalisms with 
respect to which coverage metrics are naturally defined. 

The first formalism is that of hardware description languages (HDL). A typical 
HDL program specifies the input and output variables of the various modules of the 
design, and, using control and assignment statements, the interaction of the modules 
among themselves and with an environment that provides the input signals. Reasoning 
about rich HDL such as Verilog involves difficult technical details.' We consider here 
the simplified model of control flow graph (CFG). Each HDL statement corresponds 
to a control state and induces a node in the CFG. We refer to CFG nodes as locations. 
Assignment statements have a single successor, and control statements, such as if or 
while, have several successors, corresponding to the possible locations to which the 
control can jump. Transitions from a control statement to its successors are labeled by an 
expression that guards the transition. Recall that the design interacts with an environment 
that supplies its input signals. When the design is described as a CFG, the interaction 
induces a traversal of the CFG. Formally, given a CFG G with a set V of locations, 
and a test t = . . . ,in C (2^)”+' of input assignments, the execution of G on t is 

a sequence (zq, ([j, . . . , 0 , . . . , 1 Ol), . . . , {inJoT ■ ■ y C (2^ X 

U+ X 2*^)”+' such that Ig is the initial location of G, for all 0 < z < rz, the location l^. 
corresponds to a read and write assertion, oz is the new assignment to the output variables, 
and lg~^^ matches the control flow of the CFG from location upon reading Zz+i. The 
locations ll~^^, ... , then correspond to the control flow of the CFG from (q’'"' until 
the next input assignment is read. We often ignore the input and output variables and 
refer to the interaction as a word in V* obtained by projecting the execution above on V. 

The second formalism is that of sequential circuits. We refer to a circuit as a tuple S = 
(/, O, C, T , G, Co), where / and O are the sets of input and output signals, respectively, 
£ is a set of latches, cq G 2^ describes the initial values of the latches, and T and Q are 

* For a description of a formal model of a real-life HDL see, for example, [FLL095]. 
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families of the next-state and output functions. Thus, each latch I G C has a function 
fi : 2^ X 2^ — >■ {0, 1} in iF, and each output signal o G O has a function : 2^ x 2^ — >■ 
{0, 1} in Q. A configuration c G 2^ of the circuit describes the value of each latch. The 
circuit starts its interaction with the environment in configuration cq. When the circuit is 
in configuration c and it reads a set i G 2^ of input signals, it moves to configuration c' in 
which the value of each latch I is ffii, c), and in which it sends to the environment the set 
of output signals o with go(i, c) = 1. Accordingly, the execution of a circuit 5 on a test 
f = G (2^)”+\ is a sequence (zo,Co,Oo), (zi,Ci,Oi), . . . , (z„,c„,o„) G 

(2^ X 2'^' X 2*^)”+^ that satisfies the conditions above. 

Both HDL and circuits enable a description of the design at different levels of abstrac- 
tion [Hos95], yet abstraction is most naturally supported when the design is modeled as 
a symbolic finite state machine (FSM). We assume that the design is defined with respect 
to a set X of state variables, and it is specified by predicates on X and X' - a. primed 
version of the variables in X. Formally, an FSM is a tuple F = (/, O, A, 0in,Onext, Q), 
where I and O are the input and output variables, X is the set of state variables, inducing 
the state space 2^, 9 in is a predicate on X describing the set of initial states, 9next is a 
predicate on A U A' describing the transition relation (there is a transition from state 
u to state V iff 9next{u, v')), and is a family of predicates that associates with each 
input or output variable s a predicate on A describing the set of states in which s 
holds. Likewise, predicates on A are used to describe other sets of interest, for example, 
the set of fair states when the design comes with an unconditional fairness constraint. 
Formally, a fair FSM F is a tuple F = {1,0, X, 9in,9next,G, ot), where a is a predicate 
on A describing the accepting condition. A behavior tt is accepted by F if it satisfies 
a. The simplest accepting condition is Biichi condition [Biic62] (called impartiality in 
[MP92]), where a is a predicate on A and a behavior tt satisfies a if it visit a state 
satisfying a infinitely often. 

2.2 Model Checking, Vacuity, and Coverage 

In linear-time model checking, we check whether a design has a desired behavior by 
checking whether a Biichi automaton for the negation of the specification has accepting 
runs on an FSM describing the design [VW86]. The specification can be expressed as 
an LTL formula [Hol97], as a ForSpec formula [AFG+02], or as a Biichi automaton 
[HHK96,Kur98]. A specification (p in linear temporal logic can be translated to a nonde- 
terministic Biichi automaton A^cp that accepts all words that do not satisfy (p [VW94]. 
Given an automaton we check that the product of F with A^p,, which is a fair 
FSMF X does not contain accepting paths. 

Sanity checks for model checking address the problem of errors in the modeling of 
the design and the desired behavior, which are not discovered by model checking. These 
problems may cause “false positive” results of model checking and conceal errors in the 
design. Two such checks are vacuity and coverage, which we briefly review below (for 
the full details, see [BBER01,KV03,HKHZ99,CKV01]). 

Intuitively, an FSM F satisfies a formula p vacuously if F satisfies p yet it does 
so in a non-interesting way, which is likely to point on some trouble with either F or 
p. In order to formalize this intuition, we first say that a subformula fi of p does not 
affect p in F if for every formula the FSM F satisfies p\fi G- iff F satisfies p. 
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where Lp\il) ^ denote the formula obtained from if by replacing ip with ^ [BBEROl]. 
As shown in [KV03], when ip has a single occurrence in ip, then instead of checking 
the replacement of ip by all formulas one can check only the replacement of ip by 
the formulas true and false. Thus, ip does not affect p m F whenever F satisfies 
ippip -(r- true] iff F satisfies p\ip -ir- false]. Now, an FSM F satisfies a formula ip 
vacuously iff F \= p and there is some subformula ip of p such that ip does not affect p 
in F. Equivalently, F satisfies p vacuously if F \= p and there is some subformula ip 
of p such that F also satisfies p\ip ^ _L], where _L is either false or true, depending 
on the polarity of ip in p. It is easy to see that vacuous satisfaction can be detected by 
a naive algorithm that model checks F with respect to formulas obtained from p. More 
sophisticated algorithms are suggested in [PS02,KV03,Cho03AFF+03]. 

Coverage in model checking was introduced in [HKHZ99,KGG99]. The metric in 
[HKHZ99] is based on FSM mutations. For an FSM F = (/, 0 , 9 next, G), a 
state w G 2^ and an output variable q G O, a mutant FSM F^u^q is obtained from F 
by dualizing the value of q in the state w. Thus, if Tg is the predicate describing the 
set of states satisfying q in F, then the predicate fw,q, which describes the set of states 
satisfying q in F^j^q, is satisfied by w iff Tq is not satisfied by w. For all states v ^ w, 
the predicate fw,q is satisfied by v iff Tq is satisfied by v. For an FSM F, a specification 
p that is satisfied in F, and an output variable q, we say that p q-covers w iff ^ no 
longer satisfies p. By [HKHZ99], a state is covered if it is q-covered for some output 
variable q. It is easy to see that the set of states g-covered by p can be computed by a 
naive algorithm that performs model checking of p in F^e q for each state w of F. More 
sophisticated algorithms are suggested in [HKHZ99,CKV01,CKKV01]. 

Chockler et al. also suggest the following refinement of coverage metrics [CKKV 01]. 
Instead of performing local mutations in F, we can perform local mutations in the infinite 
tree Tp obtained by unwinding F. A state w of F can appear many (possibly an infinite 
number of) times in Tp. Flipping the value of q in one occurrence of w in Tp can have 
a different effect from flipping the value of q in all or some of the occurrences of w in 
Tp. These differences are captured by the notions of node, structure, and tree coverage. 
Node coverage of a state w corresponds to flipping the value of q in one occurrence of 
w in the infinite tree. Structure coverage corresponds to flipping the value of q in all 
the occurrences of w in the tree. Chockler et al. describe a framework in which node, 
structure, and tree coverage can be computed by a symbolic algorithm; minor changes 
are required to capture the different types of coverage [CKKVOl]. We describe their 
algorithm in more detail in Section 5. 

In this paper we introduce new types of mutations and new types of coverage metrics 
in model checking in order to capture better the different notions of coverage used in 
simulation-based verification. Coverage in model checking is performed by applying 
mutations to a given FSM and then examining the resulting mutant FSMs with respect 
to a given specification. Each mutation is generated in order to check whether a specific 
element of the design is essential for the satisfaction of the specification. As we explain 
in more detail in Section 4, mutations correspond to omissions and replacements of small 
elements of the design, which can be given as an HDL program, an FSM, or a sequential 
circuit. Once we have a mutant FSM, there are two coverage checks we can perform 
on it. 
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1 . Falsity coverage: does the mutant FSM still satisfy the specification? 

2 . Vacuity coverage: if the mutant FSM still satisfies the specification, does it satisfy 
it vacuously? 

Falsity coverage is the metric introduced in [HKHZ 99 ], and we extend it here to 
handle mutations richer than these studied in the literature so far. Vacuity coverage is 
new. As we demonstrate in Example 1 , it often provides information that falsity coverage 
fails to detect. In particular, in mutations that are based on omission of elements from 
the original design (as we are going to see in Section 4 , such mutations are popular 
in metrics adopted from simulation-based verification), falsity coverage is useless for 
universal specifications. Indeed, having less behaviors, the mutant design is guaranteed 
to satisfy all the specifications satisfied by the original design. 

Example 1 . Consider the FSM F described below, which abstracts a design with respect 
to the output signals grant i and grant2- Let (f = G {grant i — >■ Fgrant2)- Thus, 
ip requires that (in all execution paths) each grant 
to the first user is followed by a grant to the second 
user. It is easy to see that p is satisfied in F. Recall 
that the goal of coverage metrics is to check whether 
all the elements of the design play some role in the 
satisfaction of p. Let us see which parts of F are 
covered by p. We refer only to structure coverage 
in this example. 

• The positive value of grant2 in W4 is essential to the satisfaction of p\ the state is 
falsity covered by p with respect to mutations that flip the value of grant£. 

• The value of grants in rui is not essential to the satisfaction of p. On the other hand, 
the designer had a reason to set it to true in wi, as it is essential to the non- vacuous 
satisfaction of p\ the state wi is vacuity covered by p with respect to mutations in which 
wi is omitted and with respect to mutations that flip the value of grant 1 . 

• One may also question negative values of variables. Lor example, while the negative 
value of grant2 in Wq is not essential to the satisfaction of p, it is essential to its non- 
vacuous satisfaction: the state wq is vacuity covered by p with respect to mutations that 
flip the value of grant2 ■ 

• Consider now the value of grant2 in the state rt;2- All the paths of F that pass through 
W 2 describe a behavior in which two grants - in both W2 and in 1V4, are given to the 
second user, after at most one grant was given to the first user. The specification does 
not require such a behavior, nor does it require a correspondence between the number 
of grants that each user gets. The labeling of W2 indeed does not play a role in the 
satisfaction of p: the state V02 is neither falsity nor vacuity covered by p with respect 
to mutations that omit W2 or flip the value of grant 2- This information may hint on a 
possible impreciseness or incompleteness in the definition of p. 

3 Coverage Metrics in Simulation-Based Verification 

In this section we survey coverage metrics in simulation-based verification - metrics we 
are going to adopt for the setting of formal verification in the next section. Each of the 
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metrics is “tailored” for a specific representation of the design or a specific verification 
goal. The reader is referred to [TKOl] for a detailed survey. All metrics refer to a set of 
inpnt seqnences (or tests) t G (2^)* with respect to which the design is simnlated. 

3.1 Syntactic Coverage Metrics 

Syntactic coverage metrics assnme a specific formalism for the description of the design 
and measure the syntactic part of the design visited in the process of execution of a given 
inpnt sequence. Commonly [Mar99,TK01], high coverage according to syntactic-hased 
metrics is considered a precondition to moving to other more sophisticated (and time 
consnming) coverage metrics. 

Code Coverage. Code-hased coverage metrics refer to the HDL program that describes 
the design or to its CFG. Measuring code coverage requires little overhead and it is easy 
to interpret the coverage information. This makes code coverage the most popular metric 
[UZ98,TK01]. The most widely used code-coverage metrics are statement and branch 
coverage. Essentially, an object is covered if it is visited during the execution of the input 
sequence. Again, the fully-formal definition depends on the particular HDL used, but a 
semi-formal definition is given in terms of the computation of the CFG as follows. Let G 
be a CLG. For an inpnt sequence t € (2^)* snch that the execution of G on t, projected on 
the sequence of locations, is (qj • ■ • , (m, we say that a statement r is covered by t if there 
is 0 < j < m such that the control location Ij corresponds to r. We say that a branch 
{I, I') between two control locations is covered by t if there is 0 < j < m — 1 such that 
Ij = I and = V . More sophisticated metrics measure the way expressions in the 
guards labeling the CFG’s transitions are satisfied. For example, expression coverage 
checks whether a Boolean expression has been satisfied by all its satisfying assignments 
(e.g., whether oi == 02 has been satisfied by both an oi = 02 = 0 and an oi = 02 = 1 
assignment). 

Circuit Coverage. Circuit-strncture based coverage metrics refer to the circuit that de- 
scribes the design. Thus they identify the physical parts of the circuit that are covered. 
Measuring circuit coverage is usually easy and it is easy to interpret the coverage infor- 
mation. Unlike code coverage, however, it is not easy to use the coverage information 
in order to generate new tests that direct simulation towards the nnexplored areas of the 
design. The most widely nsed circuit-coverage metrics are latch and toggle coverage 
[HH96,KN96]. Essentially, a latch is covered if it changes its valne at least once during 
the execntion of the input sequence. Similarly, an output variable is covered if its value 
has been toggled. Lormally, for a circuit S and an input sequence t € ( 2 ^)"+^ snch that 
the execution of 5 on f is (io, cq, oq), (ii, ci, oi), . . . , (i„, c„, o„), we say that a latch 
I € £ is covered by t if there is j > 0 such that I € cg iff I ^ Cj . Similarly, an output 
variable o G O is covered by t if there are 0 < ji < j 2 such that o G og iff o ^ Oj^ iff 
o G Oj^ . Note that toggle coverage requires that the value of an output variable should 
be changed at least twice dnring the execntion of t. 
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3.2 Semantic Coverage Metrics 

Semantic coverage metrics measure the part of the functionality of the design exercised 
by the set of input sequences. Semantic coverage metrics require user help and are more 
sophisticated than syntactic coverage metrics. We consider the following metrics. 



FSM Coverage. Due to the large size of FSMs for complete systems, FSM-based cov- 
erage metrics refer to more abstract FSMs constructed manually by the designer, or 
automatically extracted from the design by projecting its symbolic description on a sub- 
set of the state variables as explained in Section 2.1 [TKOl]. Similarly to code coverage, 
a state or a transition of the abstract FSM is covered if it is visited during the execution 
of the input sequence. The fact that coverage is checked with respect to an abstract FSM 
makes the interpretation of the coverage information harder (linking the uncovered parts 
of the FSM to uncovered parts of the HDL program is not trivial) and have led to the 
use of more sophisticated metrics. In particular, limited-path coverage metrics check 
that important sequences of behavior are exercised [SA99]. Transition coverage can be 
viewed as a special case of path coverage, for paths of length 1 . 



Assertion Coverage. In assertion coverage (“functional coverage”, in [TK01,Cad03]), 
the user provides a list of assertions referring to the variables of the design. The assertions 
describe some conditions that may be satisfied during the execution or a state of the 
design during the execution. They may be propositional (“snapshot tasks”) or temporal 
(describing a behavior along several clock cycles). A test t covers an assertion a if the 
execution of the design on t satisfies a. The assertion-coverage metric measures what 
assertions are covered by a given set of input sequences. 



Mutation Coverage. In mutation coverage, the user introduces a small change (aka 
“mutation”) to the design, and checks whether the change leads to an erroneous behavior 
[DLS78,Bud81,ZHM97]. The coverage of a test t is measured as the percentage of the 
mutant designs that fail on t, that is, the percentage of the mutations that t “catches”. The 
list of interesting mutations can be written manually or automatically following some 
mutation criteria. For example, a local mutation can be flipping a value of one output 
variable in a circuit. In mutation coverage the goal is to find a set of input sequences such 
that for each mutant design there exists at least one test that fails on it. As discussed in 
Section 2.2, mutation coverage is the metric that inspired most of the work on coverage 
in model checking. 



4 Coverage Metrics in Model Checking 



In this section we discuss how the coverage metrics from simulation-based verification 
can be adopted in model checking. Thus, for each of the metrics described in Section 3, 
we define a metric that can be used in the context of model checking. 
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4.1 Syntactic Coverage 

In syntactic coverage, we assume that we are given the syntactic representation of the 
design (an HDL code or a CFG) with respect to which we measure the coverage. Since 
in the process of model checking we visit the whole reachable part of the design, metrics 
that measure the part of the design exercised during the simulation cannot be applied 
directly to model checking. Essentially, we adopt these metrics by replacing the question 
whether a part of the design has been visited during the simulation by the question 
whether the part plays a role in the success of the verification process, where playing a 
role means that the part is essential for the satisfaction or the non-vacuous satisfaction 
of the specihcation. The latter is checked by reasoning about the behavior of a mutant 
design in which the part is modihed or omitted. 

Code Coverage. Let G be a CFG and (p a specification that is satished in G. We say 
that a statement r of G is covered by (p if omitting r from G causes vacuous satisfaction 
of p in the mutant CFG. Similarly, a branch {I, I') of G is covered if omitting it causes 
vacuous satisfaction of p. Note that falsity coverage would be meaningless here, since 
omitting a statement or a branch of CFG results in a design with fewer behaviors, 
which is guaranteed to satisfy the universal specihcation. In expression coverage, we 
check whether omitting the behaviors in which the variables have a particular satisfying 
assignment for a particular expression leads to vacuous satisfaction of p. 

Circuit Coverage. Recall that latch and toggle coverage metrics check whether the 
value of a specihc latch or variable in the circuit changes during the execution of an 
input sequence. We replace this question by the question whether disabling the change 
causes the specihcation to be satished vacuously. Thus, a latch ( G £ is covered if the 
specihcation is vacuously satished in the circuit obtained by hxing the value of I to 
its initial value. Similarly, an output variable o G O is covered if the specihcation is 
vacuously satished in the circuit obtained by allowing o to change its value only once. 
Thus, if the initial value of o is 0, the circuit is obtained by hxing o to 1 as soon as it 
changes its value to 1, and if the initial value of o is 1, the circuit is obtained by hxing o 
to 0 as soon as it changes its value to 0. 

4.2 Semantic Coverage 

Among the semantic coverage metrics, mutation coverage has already been adopted to 
the setting of model checking. As discussed in Section 2.2, we suggest a strengthening 
of the adopted metrics by checking the effect of the mutation not only on the satisfaction 
of the specihcation, but also on its vacuous satisfaction. Below we describe the adoption 
of the other semantic coverage metrics. 

FSM Coverage. In FSM coverage we are given an abstract FSM F and we check the 
inhuence of mutations and omissions in this FSM on the result of model checking of the 
specihcation p in the design. In state coverage, for a state w of F we check the inhuence 
of omission of w or changing the values of output variables in w on the (non-vacuous) 
satisfaction of the specihcation in the design. Clearly, a mutant FSM Fyj obtained from 
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F by omitting w has fewer behaviors than F, thus for omissions of a state we only check 
vacuity coverage. On the other hand, a mutant FSM F^^o obtained from F by flipping 
the value of the output variable o G O inw can also falsify the specification, thus we 
check falsity and vacuity coverage. 

In path coverage, we check the influence of omitting or mutating a finite path on 
the (non-vacuous) satisfaction of the specification in the design. A path tt of length c 
in F is a sequence of states wi, ... ,Wc of F such that for all 1 < i < c — 1 we have 
Onext {wi j Wj+i ) . Let US first define coverage for omissions of a path. A path tt is covered 
by Lp if the mutant FSM obtained from F by omitting all behaviors that contain tt 
satisfies vacuously. On the other hand, we can also introduce mutations that replace 
7T with a mutant path tt in the FSM. Then, the mutant FSM is obtained from F 
by replacing tt with tt. The mutant FSM F„. * can falsify pi or can satisfy p vacuously, 
thus for mutations that replace a path with another, mutant, path we check both falsity 
and vacuity coverage. We note that all possible mutations in the FSM can be introduced 
consistently on each occurrence of the mutated element, on exactly one occurrence, or on 
a subset of occurrences, thus resulting in structure, node, or tree coverage, respectively. 

Assertion Coverage. An input to assertion-coverage check is an FSM F, a specification 
p that is satisfied non-vacuously in F, and a list of LTL assertions oi, . . . , Ofe. An 
assertion Ui is covered hy pm F if the mutant FSM Fai obtained from F by omitting 
all behaviors that do not satisfy a satisfies p vacuously. We note that this definition is 
similar to the definition of FSM path coverage. The only difference is in the description 
of the mutation: in FSM path coverage we omit behaviors that contain a given finite path 
TT, whereas in assertion coverage we omit behaviors that do not satisfy a given assertion. 



5 Coverage Computation 

In Section 4 we described new coverage metrics for model checking. In this section 
we discuss how to compute these metrics. We first show that both vacuity and falsity 
coverage can be reduced to model checking (possibly of mutant specifications and/or 
mutant designs). Let F be an FSM, p a specification that is satisfied in F non-vacuously, 
and F a mutant FSM. If F does not satisfy p, we say that F is falsity covered by p. If 
F satisfies p, it still may be vacuity covered by p if it satisfies p vacuously. Formally, 
F satisfies p vacuously \f F \= p and there exists ip G cl{p) such that F satisfies 
p[tfi G- _L]. Thus, like falsity coverage, we check whether a mutant design F satisfies a 
specification, only that here the specification is also mutated. 

Mutation Coverage. The algorithm we present for falsity-coverage computation is based 
on the coverage algorithm described in [CKKVOl]. That algorithm computes symboli- 
cally falsity coverage for mutations that flip the value of a variable q G O in one state w 
of the FSM. The idea is to look for a fair path in the product of the mutant FSM F and an 
automaton for the negation of p. The state space of the product is 2^ X F, where 
X is the set of state variables of F, S is the state space of and the transitions of the 

product are induced by the transition relations of F and In order to compute the 

set of covered states, it is suggested in [CKKVOl] to add |AT| new variables that encode 




122 



H. Chockler, O. Kupferman, and M.Y. Vardi 



the state w in which the value of q is flipped. It is now possible to define symholically 
an augmented product, with state space 2^ x 2^ x S, where the first component of a 
state {w, u, s) is the state w that is being considered, and the two other components are 
as in the usual product automaton. The value of the first component is chosen nondeter- 
ministically at initialization and is kept unchanged. The copy of the augmented product 
with first component w checks whether the mutation of F in which q is flipped in w 
contains a fair path (in which case flipping qinw violates the specification). Thus, when 
the augmented product is in a state (w,w,s), the set of successor states contains all 
triples {w, u, t) such that m is a successor of w and t G S(s, a), where a is the label of 
w in The above describes structure coverage, where the value of w is flipped in 
all visits. Likewise, we can define an augmented product in which the value of g in w is 
flipped only one time (node coverage) or some of the times (tree coverage). We can now 
use a symbolic algorithm in order to find the set F of all triples (w, u, s) from which 
there exists a fair path in the augmented product automaton. The covered states are those 
w such that {w, uq, sq) G P, for some initial states uq of F and sq of 



Vacuity Coverage. Recall that checking whether a system satisfies a specification vacu- 
ously involves model checking of a mutant specification. We adjust the symbolic algo- 
rithm in [CKKVOl] to this setting by adding a new variable x that encodes the subfor- 
mula Ip G cl{(fi) that is being replaced with _L. The variable x is an integer in the range 
0, . . . , \cI{lp)\, thus it can be encoded with 0(log |(/?|) Boolean variables. The value 0 of 
X stands for “no replacement”, thus it checks the satisfaction of (f in the system. As with 
mutations, the values of these variables are chosen nondeterministically at initialization 
and are kept unchanged. In the automaton A^^p, each state variable corresponds to a 
subformula (cf. [BCM+92]), thus the nondeterministic choice of the subformula leads to 
a mutant automaton . The state space of the augmented product now consists 

of triples {x, u, s), where x encodes the subformula replaced with _L, and u and s are 
the components of the product automaton. The successors of {x, u, s) are the triples 
{x, u' , s') such that {u' , s') is a possible successor of {u, s) in a product between the 
system with the automaton where ip is the subformula encoded by x. The 

subformulas that affect the value of Lp in the systems are these encoded by a value x for 
which there are initial states ug and sq of the system and the automaton, respectively, 
such that there is a fair path from (x,uq, sg). Let F be the set of triples from which a fair 
path exists in the augmented product (as above, F can be found symbolically), and let F' 
be the intersection of F with the initial states of the system and the automaton, projected 
on the first element. Note that x G F' iff the subformula associated with x affects the 
value of ip in the system. Thus, %p is satisfied vacuously in the system if -iP'(O) and 

In order to get a symbolic algorithm for vacuity coverage, we combine the above 
algorithm with the one of [CKKVOl]. For example, if we want to find the set of states w 
such that flipping the value of g in w causes the specification to be satisfied vacuously, 
we augment the state space of the product of F and by variables that encode both 
the state in which we do the mutation and the subformula that is being replaced with _L. 
As we specify below, if we want to check vacuity coverage for other types of mutations, 
we use the variables in order to encode the other types of mutations. 
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Code Coverage. Recall that in code coverage we need to check whether the omission 
of parts of the code causes the specification to be satisfied vacuously. Accordingly, for 
code coverage, it is simpler to define the mutations with respect to the HDL code. Let 
k be the number of elements in the code we want to check (e.g., the number of lines). 
We introduce a new variable mut, which is an integer in the interval [1, . . . , k]. The 
value i of mut indicates that the mutation is in element li, which we want to omit, and 
we need 0(log k) Boolean variables to encode it. The HDL code is instrumented using 
source-to-source translation in (see [BKM02] as an example of such instrumentation) 
so that li in the code is replaced by the statement “if {mut ^ i) then li else skip”. The 
instrumented code represents all the mutant designs^. The product of the FSM induced 
by the instrumented code and subsumes all the mutations of the code. It is now 
possible to apply the symbolic algorithm described above (instead of the variables that 
encode w, we now have the variables that encode mut) for detecting the mutations that 
lead to vacuous satisfaction. 

In expression coverage, we do something similar. Let ei , . . . , Cm be the expressions 
we want to check, and let = {vj, . . . , be the Boolean variables over which 
is defined. Assume that n bounds the number of variables in every expression. Let “if 
Ci then Bi” be the statement that contains e* as a guard (handling of “while” or “until” 
statements is similar). Recall that we want to check, for each and for each satisfying 
assignment / G 2^*, whether skipping Bi when the variables have value / causes the 
specification to be satisfied vacuously. Accordingly, we add a variable mut (encoded by 
0(log m) Boolean variables) that indicates the expression to be checked, and n variables 
ui, . . . ,Un that encode assignments to n variables. As usual, the variables get their value 
nondeterministically at initialization. The HDL code is now instrumented so that “ifCi 
then B” in the code is replaced by “if {mut ^ i) or (cj A Vi<j<„ ^ uj) then Bi else 
skip”. It is now possible to apply the symbolic algorithm described above for detecting 
the expressions and assignments that lead to vacuous satisfaction. 

Circuit Coverage. In latch coverage, we restrict the product of F and to paths in 
which the value of a latch is not allowed to change, and check whether this causes vacuous 
satisfaction. Thus, we augment the product with variables that encode the examined latch 
and (for the vacuity check) the subformula of Lp that we replace with _L. 

FSM Coverage. State and transition coverage can be computed using the techniques of 
mutation-based metrics. We now describe the computation of path coverage. We start 
with mutations that omit all behaviors that contain a given finite path tt = wi, . . . ,Wc. 
Let be a monitor that filters away paths that contain tt as a sub-path. That is, is 
a fair FSM that accepts paths p such that tt is not a sub-path of p. Since only cares 
for the values of control variables that encode the states (and not, for example, for the 
values of output variables in these states), the set of input variables of is the set of 
control variables X of F, and does not have output variables. For a given path tt, 
the mutant FSM F^^ is the product FSM F x which contains only the computations 
of F that do not have tt as a sub-path. Then, tt is vacuity covered by ip if Ft^ satisfies 

^ The user may wish to include 0 (no mutation) in the range of mut, in which case the instrumented 
code represents also the original design. 
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Lp vacuously. For a set of paths {tti, . . . , tt^}, we can compute the set of covered paths 
symbolically using the techniques as described above for vacuity coverage. 

In a similar way we can define mutations of paths that replace a finite path tt with a 
path TT of the same length, redirecting the system to another execution. If a mutated path 
is of length 1, the mutation redirects one transition. For a path tt replaced with a mutant 
path TT, we use a monitor In the product of F with M.„. all the occurrences of 
TT are replaced by tt . Note that for mutated (rather than omitted) paths we can compute 
both falsity and vacuity coverage. 

Assertion Coverage. For an LTL assertion a, a monitor for a is the automaton A^a- 
Given assertions oi, . . . , Ofc, the mutant FSM is the product F x A^ai x ... x A^a^ ■ 
Falsity and vacuity coverage of a set of assertions is computed similarly to FSM path 
coverage, where the variable mut encodes the assertion Umut for 1 < mut < k. 
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Abstract. The standard technique for LTL model checking (M |= -up) consists 
on translating the negation of the LTL specification, ip, into a Biichi automaton 
Ayp, and then on checking if the product M x Aip has an empty language. The 
efforts to maximize the efficiency of this process have so far concentrated on 
developing translation algorithms producing Biichi automata which are ‘"as small 
as possible”, under the implicit conjecture that this fact should make the final 
product smaller. In this paper we build on a different conjecture and present an 
alternative approach in which we generate instead Biichi automata which are “as 
deterministic as possible”, in the sense that we try to reduce as much as we are 
able to the presence of non-deterministic decision states in A,p. We motivate our 
choice and present some empirical tests to support this approach. 



1 Introduction 

Model checking is a formal verification technique which allows for checking if the 
model of a system verifies some desired property. In LTL model checking, the system 
is modeled as a Kripke structure M, and (the negation of) the property is encoded as an 
LTL formula <p. The standard technique for LTL model checking consists on translating 
(p into a Biichi automaton A^p, and then on checking if the product M x A^p has an empty 
language. To this extent, the quality of the translation technique plays a key role in the 
efficiency of the overall process. 

Since the seminal work in [6], the efforts to maximize the efficiency of this process 
have so far concentrated on developing translation algorithms which produce from each 
LTL formula a Biichi automaton (BA henceforth) which is “as small as possible” (see, 
e.g., [1,12,3,5,4,9,7]). This is motivated by the implicit heuristic conjecture that, as the 
size of the product M x A^ of the Kripke structure M and the BA A^, is in worst-case 
the product of the sizes of M and A^, reducing the size of A^ is likely to reduce the size 
of the final product also in the average case. This conjecture is implicitly assumed in 
most of papers (e.g., [1,12,5,7]), which use the size of the BA’s as the only measurement 
of efficiency in empirical tests. 

Remarkably, Etessami and Holtzmann [3] tested their translation procedures by mea- 
suring both the size of resulting BA’s and the actual efficiency of the LTL model checking 
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HPRN-CT-2000-00102, and has thus benefited of the financial contribution of the Commission 
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process, and noticed that a smaller number of states in the automaton does not nec- 
essarily improve the running time and can actnally hurt it in ways that are difficnlt to 
predict” [3]. 

In this paper we propose and explore a new research direction. Instead of wonder- 
ing what makes the BA smaller, we wonder directly what may make the product 
automaton M x A^p smaller, independently on the size of the BA A^p. We start from 
noticing the following fact: if a state s in M x A^ is given by the combination of the 
states s' in M and s” in A^, and if s” is a deterministic decision state — that is, each 
label may match with at most only one successor of s" — then s has at most the same 
amount of successor states as s', no matter the number of successors of s". From this 
fact, we conjecture that reducing the presence of non-deterministic decision states in 
the BA is likely to reduce the size of the hnal product in the average case, no matter if 
this produces bigger BA’s. (Notice that it is not always possible to reduce completely 
the presence of non-deterministic decision states, as not every LTL formula p can be 
translated into a deterministic BA, and even deciding whether the translation is possible 
belongs to EXPSPACE and is PSPACE-Hard [11].) 

In order to explore the effectiveness of the above conjecture, we thus present a new 
approach in which we generate from each LTL formula a BA which is “as deterministic 
as possible”, in the sense that we try to reduce as much as we are able to the presence of 
non-deterministic decision states in the generated automaton. This is done by exploiting 
the idea of semantic branching, which has proved very effective in the domain of modal 
theorem proving [8]. 

The rest of the paper is structured as follows. In Section 2 we present some pre- 
liminary notions. In Section 3 we describe the main ideas of our approach. In Sec- 
tion 4 we describe the LTL to BA algorithm we have implemented. In Section 5 we 
present the results of an extensive empirical test. In Section 6 we conclude, describ- 
ing also some future work. Lor lack of space, the correctness and completeness of the 
algorithm is proved in an extended technical report, which is available at http : // 
WWW. science .unitn. it/~stonetta/modella.html 



2 Preliminaries 

We use Linear Temporal Logic (LTL) with its standard syntax and semantics [2] to 
specify properties. Let X be a set of elementary propositions. A propositional literal 
(i.e., a proposition p in X or its negation -ip) is a LTL formula; if pi and p 2 are LTL 
formulae, then -ipi, Ap 2 , Vp 2 , Xpi, piU(p 2 , are LTL formulae, where X, 

U and R are the standard “next”, “until” and “releases” temporal operators respectively. 
We see the familiar T (true), _L (false), Fpi (eventually pi) and Gpi (globally pi) as 
standard abbreviations of p V -■p, p A -ip, TUpi and _LR(pi respectively. 

For every operator op in {A, V,X,F,G,U, R}, we say that p is an op-formula if 
op is the root operator of p (e.g., X(pUq) is an X-formula). We say that the occurrence 
of a subformula pi in an LTL formula p is a top level occurrence if it occurs in the scope 
of only boolean operators -i, A, V (e.g., Fp occurs at top level in Fp V XF< 7 , while Fg 
does not). 
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A Kripke Structure M is a tuple {S, So, T, C) with a finite set of states S, a set of 
initial states S'o C S', a transition relation T C S x S and a labeling function £ : S — 2^, 
where E is the set of atomic propositions. 

A labeled generalized BA (LGBA) [6] is a tuple A := {Q,Qo,T, C, D,T), where 
Q is a. finite set of states, Qo C Q is the set of initial states, T Q Q x Q is the transition 
relation, D := 2^ is the finite domain (alphabet), £ : Q — >■ 2^ is the labeling function, 
and iF C 2*5 is the set of accepting conditions (fair sets). A run of A is an infinite 
sequence cr := cr(0), ct( 1), ... of states in Q, such that cr(0) € Qo and T(a(i), a(i + 1)) 
holds for every i > 0. A run a is an accepting run if, for every Fi G T , there exists 
a{j) G Fi that appears infinitely often in cr. An LGBA A accepts an infinite word 
^ := ^(0), ^(1), ... G if there exists an accepting run a := cr(0), cr(l), ... so that 
^(z) G £(cr(z)), for every i > 0. Henceforth, if not otherwise specified, we will refer to 
an LGBA simply as a Biichi automaton (BA). 

Notice that each state in a Kripke structure is labeled by one total truth assignment 
to the propositions in E, whilst the label of a state in a BA represents a set of such as- 
signments. A partial assignment represents the set of all total assignments/labels which 
entail it. We represent truth assignments indifferently as sets of literals {k}i or as con- 
junctions of literals /\^ k, with the intended meaning that a literal p (resp. -•p) in the 
set/conjunction assigns p to true (resp. false). 

Notationally, we use ^ for representing an infinite word over 2^; ^(i) is the z-th 
element and is the suffix starting from ^(z). We use a for an infinite sequence of 
states (runs); cr(z) is the z-th element and ai is the suffix starting from cr(z). We use p 
for truth assignments. We use ip, fi, d for general formulae. We denote by succ{s, 
[succ{s, M)] the set of successor states of the state s in a BA A,p [Kripke structure M]. 

If /z is a truth assignment and p is an LTL formula, we denote by p[p] the formula 
obtained by substituting every top level literal I G p in p with T (resp. -•I with _L) 
and by propagating the T and _L values in the obvious ways. (E.g., {p V X(/?i) A (g V 
XV32)[{£,-'g}] = Xv32-) 

An elementary formula is an LTL formula which is either a constant in {T,_L}, 
a propositional literal or a X-formula. A cover for a set of LTL formulae {pk}k is a 
set of sets of elementary formulae {{'dij}j}i s.t. /\f.pk ^ (Henceforth, 

we indifferently represent covers either as sets of sets or as disjunctions of conjunc- 
tions of elementary formulae.) A cover for {pk}k is typically obtained by computing 
the disjunctive normal form (DNF) of /\^ pk, considering X-subformulae as boolean 
propositions. 

The general translation schema of an LTL formula p into a BA works as follows 
[6]. First, p is written in negative normal form (NNF), that is, all negations are pushed 
down to literal level. Second, p is expanded by applying the tableau rewriting rules: 

PlUp2^k P2^ {pi S~E.{pi\5p2)), piRp2^k P2 S{piyE.{piKp2)) (1) 

until no U-formula or R-formula occurs at top level. Then the resulting formula is 
rewritten into a cover by computing its DNF. Each disjunct of the cover represents a 
state of the automaton: all propositional literals represent the label of the state — that is, 
the condition the input word must satisfy in that state — and the remaining X-formulae 
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KRIPKE STRUCTURE M 



BUECHI AUTOMATON A 




PRODUCT MxA 



Fig. 1. Product of a generic Kripke structure with a non-deterministic (up) and a deterministic 
(down) cover expansion of ip := {py X(pi) A (g V ’Kp 2 ) ■ 



represent the next part of the state — that is, the obligations that must be fulfilled to get 
an accepting run — and determine the transitions outcoming from the state. 

The process above is applied recursively to the next part of each state, until no 
new obligation is produced. This results into a closed set of covers, so that, for each 
cover C in the set, the next part of each disjunct in C has a cover in the set. Then 
= (Q, <5 oj £, D, T) is built as follows. The initial states are given by the cover 
of p. The transition relation is given by connecting each state to those in the cover of its 
next part. An acceptance condition T) is added for every elementary subformula in the 
form ipUd, so that Fi contains every state s G Q such that s ^ {'ipXJ'd) or s |= r?. 



3 A New Approach 

3.1 Deterministic and Non-deterministic Decision States 

We say that two states are mutually consistent if their respective labels are mutually 
consistent, mutually inconsistent otherwise. We say that a state s in a BA is a deterministic 
decision state if the labels of all successor states of s are pairwise mutually inconsistent, 
a non-deterministic decision state otherwise. Intuitively, if s is a deterministic decision 
state, then every label in the alphabet is consistent with (the label of) at most one successor 
of s. A BA is deterministic if its states are all deterministic decision states and if its initial 
states are pairwise mutually inconsistent. 

We consider an LTL model checking problem M |= -k/j, where M is a Kripke 
structure and p is an LTL formula. is the BA into which ip is converted, and M x A^p 
is the product of M and A^. Each state s in M x A,^ is given by the (consistent) pairwise 
combination s' s” of some states s' in M and s" in A^, and the successor states of s are 
given by all the consistent combinations of one successor of s' and one of s": 
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succ{s,M X A^) = {s's"|s' G succ{s' , M) , s'- G succ{s'' , A^) , s'^s'' ^ -L},(2) 
\succ{s,M X A^)\ < \succ{s ,M)\ ■ \succ{s" ^ A^)\^ (3) 

where s' s" denotes the combination of the states s' and s" and “s's" ^ _L” denotes 
the fact that the combination of s' and s" is consistent. 

We make the following key observation: if s" is a deterministic decision state, then 
each successor state of s' can combine consistently with at most one successor of s" , so 
that s has at most as many successor states as s' . Thus (3) reduces to 

\succ{s^M X A^)\ < \succ{s' , M)\. (4) 

The above observation suggests to us the following heuristic consideration: in order to 
minimize the size of the product M x A^p, we should try to make A^ “as deterministic 
as we can” — that is, to reduce as much as we can the presence of non-deterministic 
decision states in Ap — no matter if the resulting BA is greater than other equivalent but 
“less deterministic” BA’s. 

Example 1. Consider the state s' of a Kripke structure M in Figure 1 (left) and its 
successor states s'^, s' 2 , Sg and S 4 with labels {p,q,...}, {p,~'q,...}, {-•p,q,...} and 
{-•p, -•q , ...} respectively. Consider the LTL formula p := {pV X(pi) A (<7 V X 792 ) for 
some LTL subformulae pi and p 2 - Consider the two covers of p: 

Cl ■■= {{p,q},{P,^V2},U,^‘Pl},{^V’l,^‘P2}}, ( 5 ) 

C 2 ■■= {{p,q},{p,^q,'K.p2},{^p,q,Xpi},{^p,^q,Xpi,Xp2}}, ( 6 ) 

which generate the two BA’s A in Figure 1 (center) respectively. In the first BA the state 

s" is a non-deterministic decision state. Thus the successors of s's" in M x A are the 
consistent states belonging to the cartesian product of the successor sets of s' and s". 
In particular, matches with all successor states of s", s '2 matches with s '2 and s", Sg 
matches with Sg and s", and S 4 matches with s" . In the second BA s" is a deterministic 
decision state. Thus, each successor of s' matches with only one successor of s" . o 

Remark 1. It is well-known (see, e.g., [11]) that converting a non-deterministic BA A 
into a deterministic one A' (when possible) may make the size of the latter blow up 
exponentially wrt. the size of the former in the worst case. This is due to the fact that 
each state s' of A' represents a subset of states of A, so that \A'\ < and 
hence I M x A'\ < |M| • whilst |M x A| < |M| • |A|. Thus, despite the local effect 
described above (4), one may suppose that globally our approach worsens the global 
performance. 

We notice instead that C{s') |= /\j C{si), so that the set of states in M matching 
with s' is a subset of the intersection of the set of states in M matching with each sp. 

{s* G M I s*s' ^ -L} C p|{s* G M \ s* s, ^ _L}.* (7) 

I 

Thus, the process of determinization may increase the number of states in the BA, but 
reduces as well the number of states in M with which each state in the BA matches. □ 
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Example 2. Consider the LTL formula and the covers of Example 1. (Notationally, we 
denote by the jth element of C^.) Then C 21 , C 22 , C 23 and C 24 match with 1/4 of the 
possible labels, whilst Cn, C 12 , C 13 and C 14 match with 1/4, 1/2, 1/2 and 1/1 of the 
possible labels respectively. 

3.2 Deterministic and Non-deterministic Covers 

Let {pk]k be a set of LTL formulae in NNL, let p denote /\^ tpk, and let C := 
be a cover for (f. C can be written as {ni U \i}i’ where Hi := G \ 

dij prop, literal} and Xi ■= {'&ij G X-formula} are the set of propositional 

literals and X-formulae in {'&ij}j respectively. Thus 

‘P ^\/{l^i ^Xi)- ( 8 ) 

i 

We say that a cover C = {pi U Xi]i as in (8) is a deterministic cover if and only if all 
Pi’s are pairwise mutually inconsistent, non-deterministic otherwise. 

Example 3. Consider the LTL formula and the covers of Example 1 . Ci is non-determini- 
stic because, e.g., {p, g} and {p} are mutually consistent. C 2 is deterministic because 
{P) ?}> {P: ~'Q}^ {~'Pi q} and {-•p, -■g} are pairwise mutually inconsistent. o 

In the construction of a BA, each element pi A Xi in a cover C represents a state Si, where 
Pi is the label of the state and Xi is its next part (by abuse of notation, we henceforth call 
such a formula “state”). Thus, a deterministic cover C represents a set of states whose 
labels are pairwise mutually inconsistent. Consequently, deterministic covers (when 
admissible) give rise to deterministic decision states. 

3.3 Computing Deterministic Covers 

As said in the previous sections, the standard approach for computing covers is based 
on the recursive application of the tableau rules (1) and on the subsequent computation 
of the DNL of the resulting formula. The latter step is achieved by applying recursively 
to the top level formulae the rewriting rule 

(f' A {(fi V (P 2 ) (p' A (fi) V {(p' A P 2 ) (9) 

and then by removing every disjunct which propositionally implies another one. As in [8] , 
we call step (9) syntactic branching because it splits “syntactically” on the disjuncts of 
the top level V-subformulae. As noticed in [8], a major weakness of syntactic branching 
is that it generates subbranches which are not mutually inconsistent, so that, even after 
the removal of implicant disjuncts, the distinct disjuncts of the final DNL may share 
models. As a consequence, if the boolean parts of two disjuncts in a cover are mutually 
consistent, non-deterministic decision states are generated. 

To avoid this fact we compute a cover in a new way. After applying the tableau rules, 
we apply recursively to the top level boolean propositions the Shannon expansion 

{pA ((/^[M])) V (-ip A (p[{-'p}])). (10) 
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As in [8], we call step (10) semantic branching because it splits “semantically” on the 
truth values of top level propositions. The key issue of semantic branching is that it 
generates subbranches which are all mutually inconsistent [8]. Thus, after applying (10) 
to all top level literals in tp, we obtain an expression in the form 

\J{pihp[pi\), ( 11 ) 

i 

such that all pi ’s are all pairwise mutually inconsistent and p\ni] is a boolean combination 
of X-formulae. If all <p[/ii]’s are conjunctions of X-formulae, then (11) is in the form 
(8), so that we have obtained a deterministic cover. If not, every disjunct {pi A p\ni\) in 
(11) represents a set of states Si such that all states belonging to the same set Si have the 
same label pi but different next-part, whilst any two states belonging to different sets 
S'j’s are mutually inconsistent. 

As a consequence, the presence of non-unary sets Si is a potential source of non- 
determinism. Thus, if this does not affect the correctness of the encoding (see below), 
we rewrite each formula p\ni] into a single X-formula by applying the rewriting rules: 

X(^i A X(^2 ^ A (^ 2 ), (12) 

X(^i V X(^2 V (^ 2 )- (13) 

The result is clearly a deterministic cover. We call this step branching postponement 
because (13) allows for postponing the or-branching to the expansion of the next part. 

Example 4. Consider the LTL formula and the covers of Example 1. The cover Ci is 
obtained by applying syntactic branching to p from left to right, whilst C 2 is obtained by 
applying semantic branching to p, splitting onp and q. (As all p[pi\’s are conjunctions 
of X-formulae, no further step is necessary.) o 

Unfortunately, branching postponement is not always safely applicable. In fact, while 
rule (12) can always be applied without affecting the correctness of the encoding, this 
is not the case of rule (13). For example, it may be the case that Xpi and Xp 2 in (13) 

represent two states si and S 2 respectively so that si is in a fair set Fi and S 2 is not, 

and that the state corresponding to X(pi V P 2 ) is not in Fp, if so, we may loose the 
fairness condition F\ if we apply (13). This fact should not be a surprise: if branching 
postponement were always applicable, then we could always generate a deterministic 
BA from an LTL formula, which is not the case [11]. Our idea is thus to apply branching 
postponement only to those formulae p\pi] for which we are guaranteed it does not cause 
incorrectness, and to apply standard DNF otherwise. This will be described in detail in 
the next section. 

To sum up, semantic branching allows for partitioning the next states into mutually 
inconsistent sets of states Si, whilst branching postponement, when applied, collapses 
each Si into only one state. Notice that 

- unlike syntactic branching, semantic branching guarantees that the only possible 

sources of non-determinism (if any) are due to the next-part components (/?[/ii]’s. 

No source of non-determinism is introduced by the boolean components ^^’s; 
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cover compute _cover(^) { 

1 apply Jableaujules{ip)’, 

2 for each p occurring at top level in p { 

3 p ■- {p A p[{p}]) V A p[{^p}])-, 

4 simplify{p)\ 

5 } 

6 p:=\J^^j{p.iADNF{p[pi]))-, 

I P \/ A Vjgjj ^AkeKij Aijk)', 

8 C (p) \J A'Ktpij)', 

9 C(p):=±- 

10 for each i € I { 

II Si := (^i A X Vjgj. V’ij); 

12 Subs(si) ;= \J A Xt/>ij); 

13 it (Postponement Js-Safe(si)) 

14 then C(v?) := C(</9) V Si; 

15 else C((p) := C((is) V S'u&s(si); 

16 } 

17 return C((p);} 



// semantic branching on labels 
// boolean simplification 
// now p = Vi6/(t^i A p[pi]) 

// now p = A V,gj, Afceify 

II factoring out the X operators 
// i/>ij being AfcgXii 
H initialization of C (p) 



H postponement applied 
// postponement not applied 



Fig. 2. The schema of the cover computation algorithm 



- branching postponement reduces the number of states sharing the same labels even 
if it is applied only to a strict subset of the subformulae p[iii] in (11). Thus, also 
partial applications of branching postponement make the BA “more deterministic”. 



4 The MoDeLLA Algorithm 

In the current state-of-the-art algorithms the translation from an LTL formula p into a 
BA Acp can be divided into three main phases: 

1. Formula rewriting: apply a finite set of rewriting rules to p in order to remove 
redundancies and make it more suitable for an efficient translation. 

2. BA construction from p: build a BA with the same language of the input formula p. 

3. BA reduction: reduce redundancies in the BA (e.g., by exploiting simulations). 

In our work, we focus on phase 2. According to the new approach proposed in 
the previous section, we have conceived and implemented a new translation algorithm, 
called MoDeLLA (More Deterministic LTL to Automata) which builds a BA from an 
LTL formula trying to apply branching postponement as often as it is able to. 

4.1 The Basic Algorithm 

The general schema of the BA construction in MoDeLLA, in its basic form, is the 
standard one proposed in [6] and briefly recalled in Section 2. MoDeLLA differs from 
previous conversion algorithms in two steps: the computation of the covers and the 
computation of the fair sets. 
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Computation of the cover. The function which computes the cover of the formula ip 
is described in Figure 2. First, we apply, as usual, the tableau rewriting rules (1) (line 
1). The formula obtained is a boolean combination of literals and X-formulae. After 
applying the semantic branching rules on labels (10), we get a disjunction of formulae 
in the form (11) (lines 2-5). 

If now we applied branching postponement (12) and (13), denoting Afceitr 
ipij, we would obtain the deterministic cover: 

C^{p) ■.= {pi (14) 

j&Ji 

Unfortunately, as pointed out in section 3.3, branching postponement may affect the 
correctness of the BA. Thus, we apply it only in “safe” cases. First, for every disjunct 
Pi A p[pi] we temporarily compute DNF{p[pi]) and then we factor X out of every 
conjunction in DNF{p[pi\) (lines 6-7). We obtain a temporary non-deterministic cover 

C*{p):={p,AX^ij}iei,jeJ.- ( 15 ) 

Notice that every state Si inC^{p) is equivalent to the disjunction of | | states mC*{p): 

s^ = piAX. \J tl^ij = \J {pi A (16) 

j&Ji j&Ji 

For every z G /, we define the set of substates of Si as: 

Subs{si) := {pi A (17) 

(Subs{si) is the set Si in Section 3.3.) We extend the definition to every state s* of 
C*{p) by saying that Subs{s*) := {s*}. 

Then, the cover C{p) is built in the following way (lines 10-16): for every i G I, 
we add to C{p) Si if postponement is safe for Si, Subs{si) otherwise. Postpone- 
ment JsSafe(s) decides if branching postponement is safe for a state s according to 
a sufficient condition described in the following paragraphs. 

Computatiou of fair sets. If 14^ is the set of U-formulae which are subformulae of p, 



the usual set of accepting conditions is: 

F* :={F;^Ji;mGU^}, (18) 

:= {s e Q|s ^ V'Uz^ors 1= z?}. (19) 

We extend these definitions as follows 

F :={Fn\nG2’^^}, ( 20 ) 

F-u '■= {s G Q\ there exists z/)Uz9 G H s.t. (21) 



for each s* G Subs{s),s* tplJd or for each s* G Subs{s),s* |= 

Notice that, if 177 1 = 1 and, for every s G Q, |S'u&s(s) | = 1 (i.e. we have never applied 
branching postponement), this is the usual notion (i.e. 7^{^ui?} = 
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We say that the branching postponement is not safe for a state s if there exists F-u G T 
such that s ^ and there exist 'ipXJ'd G H^s* G Subs{s) such that s* G 

With this condition we are guaranteed that if the BA A* built without branching 
postponement has an accepting run a* over a word then the correspondent run a of 
the BA built with safe branching postponement is also accepting. 

Example 5. Consider the LTL formula := FGp. After having applied the tableau rules 
and semantic branching on labels, we obtain ip = (pA (XFGpVXGp) ) V (-■pAXFGp) . 
If s = (p A X(FGp V Gp)), the branching postponement is not safe for s. Indeed, 
Subs = {(pAXFGp), (pAXGp)} and (pAXGp) G F^pGpy huts ^ F{FGp}-Thus, 
compute-Cover produces the cover: 

{(pAXFGp),(pAXGp),(-pAXFGp)}. o (22) 

4.2 Improvements 

We describe some improvements to the basic schema of MoDeLLA described in the 
previous section. Most of them are adapted from known optimizations. 

Pruning the fair sets. In the previous section, we have noticed that the basic version 
of MoDeLLA computes fair sets. Thus, in order to reduce this number, in the hnal 
computation of the fair conditions, T, we apply the following simplihcation rules, which 
are a simple version of an optimization introduced in [12]: 

- for all F G F, if F = Q then F := F\{F}, 

- for all F, F' G F, if F C F' then F := F\{F'}. 

Remark 2. Due to the existential quantifier in the definition ( 1 8) of F^ , for every formula 
fUd G "H, we have that F{.0U^} C F^ . For this reason, after the above fair sets pruning, 
MoDeLLA will keep only those accepting condition F^ for which "H is a singleton. Thus, 
we obtain that |F| < \U^\, as in the usual construction. 

Merging states. After computing a cover, if two states si = (/ii, x)j S2 = (M2 j x) have 
the same next part x and satisfy the following property: 

for all G U^p, 

(for all sj G Subs{s\),s\ \= 'fUd) (for all S2 G Subs{s 2 ), S 2 \= ipV'd) and 
(for all sj G Subs{si), s{ ^ r?) (for all S2 G Subs{s 2 ), S 2 \= r?), 

then we substitute them with s = {piV p 2 ,x) where S''u6s(s) := Subs{si)USubs{s 2 )- 
Notice that for every F G F, we have si G F S2 G F s G F. This technique is 
a simpler version of the one introduced in [7], which however applies the merging only 
after moving labels from the states to the transitions. 

Example 6. Consider the formula of Example 5 and the cover produced by the basic 
version of MoDeLLA. After merging the states with the above technique, the cover (22) 
becomes { (T A XFGp) , (p A XGp) } . Notice that the labels T and p of the two states are 
mutually consistent so that the BA is still non-deterministic. However, we have reduced 
the number of states without increasing the non-determinism. o 
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5 Empirical Results 

MoDeLLA is an implementation in C of the algorithm described in Section 4. It imple- 
ments only phase 2, so that it can be used as kernel of optimized algorithms including 
also formula rewriting (phase 1) and BA reduction (phase 3). (Indeed, we believe our 
technique is orthogonal to the rewriting rules of phase 1 and to BA reductions.) 

We extensively tested MoDeLLA in comparison with the state-of-the-art algorithms. 
Unlike, e.g., [1,12,5,7], we did not consider as parameters for the comparison the size 
of the BA produced, but rather the number of states and transitions of the product 
M X between the BA and a randomly-generated Kripke structure. To accomplish 
this, we used lbtt i.o.i [13], a randomized testbench which takes as input a set of 
translation algorithms for testing their correctness. In particular, lbtt gives the same 
formula (either randomly-generated or provided by the user) to the distinct algorithms, 
it gets their output BA’s and it builds the product of these automata with a randomly- 
generated Kripke structure M of given size \M\ and (approximated) average branching 
factor b. lbtt provides also a random generator producing formulae of given size \^p\ 
and maximum number of propositions P. 

To compare MoDeLLA with state-of-the-art algorithms, we provided interfaces be- 
tween LBTT and Wring i.i.o [12,9] and between lbtt and TMP 2.0 [3,4]. Since lbtt 
computes the direct product between the BA and the state space, the size of the product 
is not affected by the number of fair sets of the BA. Thus, to get more reliable results, 
we have dealt only with degeneralized BA, and we have applied a simple procedure 
described in [6] to convert a BA into a Biichi automata with a single fair set. 

We have run lbtt on three PCs Dual Processor with 2GB RAM on Linux RedHat. 
All the tools and the files used in our experiments can be downloaded at http : // 
WWW. science .unitn. it/~stonetta/modella.html . 



5.1 Comparing Pure Translators 

In a hrst session of tests, we wanted to verify the effectiveness of MoDeLLA as a pure 
“phase 2” translator. Thus, we compared MoDeLLA with “pure” translators (no formula 
rewriting, no BA reduction), i.e. with GPVW [6], LTL 2 AUT [1]^ and Wring [12] with 
rewriting rules and simulation-based reduction disabled (Wring(2) henceforth). Notice 
that TMP uses LTL 2 AUT as phase 2 algorithm [3]. For reasons which will be described 
in the next section, we run also a version of MoDeLLA without the merging of states 
(MS) optimization of Section 4.2 (which we call MoDeLLA-MS henceforth). 

We fixed \M\ to 5000 statesand we made 6 grow exponentially in {2, 4, 8, 16, 32, 64}. 
We did four series of tests: 1) tests with 200 random formulae with \tp\ = 15 and P = 4; 
2) tests with 200 random formulae with \ip\ = 15 and P = 8; 3) tests on the 27 formulae 
proposed in [12]; 4) tests on the 12 formulae proposed in [3]. For every formula Lp, we 
tested both M \= ip and M ^ -tp. The results are reported in Figure 3. (In the fourth 
series, the run of GPVW and LTL 2 AUT were stopped for 6 > 16 because they caused 
a memory blowup.) 

^ For GPVW and LTL 2 AUT, we have used the reimplementation provided by Wring. 
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Fig. 3. Performances of the pure “phase 2” algorithms. X axis: approximate average branching 
factor of M. Y axis: mean number of states (left column) and of transitions (right column) of the 
product M X Aip. 1st row: 400 random formulae, 4 propositions; 2nd row: 400 random formulae, 
8 propositions; 3rd row: 24 formulae from [12]; 4th row: 54 formulae from [3]. 
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Fig. 4. Same experiments as in Figure 3, adding phases 1 and 3 to the pure “phase 2” algorithms. 
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Comparing the plots in the first column (number of states of M x A^p) we notice that 

(i) GPVW and LTL2AUT are significantly less performing than the other algorigthms; 

(ii) MoDeLLA performs better than Wring(2) in all the test series; (iii) even with MS 
optimization disabled, MoDeLLA performs mostly better than Wring(2). 

Comparing the plots in the second column (number of transitions of M x A^,) we 
notice that Wring(2) performs much better than LTL2AUT and GPVW, and that both 
MoDeLLA and MoDeLLA-MS perform always better than Wring(2). In particular, 
the performance gaps are very relevant in the fourth test series. 

5.2 Comparing Translators with Rewriting Rules and Simulation-Based 
Reduction 

In a second section of tests, we investigated the behaviour of MoDeLLA as the kernel of a 
more general algorithm, embedding also the rewriting rules (phase 1) and the simulation- 
based reduction (phase 3) of Wring and TMP. This allows us for investigating the 
effective “orthogonality” of our new algorithm wrt. the introduction of rewriting rules 
and of simulation-based reduction. 

First, we applied to our algorithm the rewriting rules described in [12] and interfaced 
MoDeLLA-MS with the simulation-based reduction engine of Wring. Unfortunately, 
since Wring accepts only states labeled with conjunctions of literals, we could interface 
Wring only with MoDeLLA-MS and not with the full version of MoDeLLA. (We 
denote the former as MoDeLLA-MSh-Wring(i 3) henceforth.) Second, we applied to 
MoDeLLA the rewriting rules described in [3] and the simulation-based reduction de- 
scribed in [4] which are respectively the phase 1 and the phase 3 of TMP. (We call this 
enhanced version of our algorithm MoDeLLAh-TMP(i 3) henceforth.) Finally, we im- 
plemented the optimization technique described in [7] . When we enable this technique, 
together with the rewriting rules and the TMP’s automata reduction, we refer to it as 
MoDeLLAh-ALL. 

We run the tests with the same parameters of the first session of tests, obtaining the 
results of Figure 4. By looking at the plots, one can observe the following facts for both 
the columns (number of states and number of transitions of M x Ap,)\ (i) if compared 
with the correspondent phase 2, MoDeLLA-MSh-Wring( i 3) and MoDeLLA-i-TMP( i 3) 
benefit a lot respectively from Wring’s and TMP’s rewriting rules and simulation-based 
reduction, although slightly less than Wring and TMP theirselves do; (ii) MoDeLLA- 
MS-(-Wring(i 3) and MoDeLLAh-TMP(i 3) perform mostly better respectively than 
Wring(i 23) and than TMP, although the gap we had with “pure” algorithms is re- 
duced; (iii) MoDeLLA-hALL performs better than all the others, except with the third 
test series where MoDeLLA-MSh-Wring(i 3) is the best performer. 



6 Conclusions and Future Work 

In this paper we have presented a new approach to build BA from LTL formulae, which 
is based on the idea of reducing as much as possible the presence of nondeterministic 
decision states in the automata; we have motivated this choice and presented a new 
conversion algorithm, MoDeLLA, which implements these ideas; we have presented an 
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extensive empirical test, which suggests that MoDeLLA is a valuable alternative as a 
core engine for state-of-the-art algorithms. 

We plan to extend our work on various directions. From the implementation view- 
point, we want to implement in MoDeLLA the simulation-based reduction techniques 
presented in [12] in order to have a tool which exploits the power of all state-of-the-art 
automata reductions. From an algorithmic viewpoint, we want to investigate new op- 
timizations steps ad hoc for our approach. From a theoretical viewpoint, we want to 
investigate more general sufficient conditions for branching postponement. 

Another interesting research direction, though much less straightforward, might be 
to investigate the feasibility and effectiveness of introducing semantic branching in the 
alternating-automata based approach of [5]. 

Finally, we would like to test the performance (wrt. time and memory consuming) 
of state-of-the-art LTL model checkers, e.g. SPIN [10], on real-world benchmarks by 
using the automata built by MoDeLLA. 
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Abstract. It has been shown that bounded model checking using a SAT solver 
can solve many verification problems that would cause BDD based symbolic 
model checking engines to explode. However, no single algorithmic solution has 
proven to be totally superior in resolving all lypes of model checking problems. 
We present an optimized bounded model checker based on BDDs and describe 
the advantages and drawbacks of this model checker as compared to BDD-based 
symbolic model checking and SAT-based model checking. We show that, in some 
cases, this engine solves verification problems that could not be solved by other 
methods. 



1 Introduction 

As the use of formal verihcation in industrial settings continues to grow [3,5], contem- 
porary research seeks diverse ways to solve the “state explosion” problem inherent in 
model checking. In recent years, the traditional methods of BDD-based symbolic model 
checking [10] have been augmented by methods which are based on Boolean Satifia- 
bility (SAT) [13,11] that can solve the Bounded Model Checking (BMC) [7] problem. 
Unlike the model checking problem that, given a model M and a property (p, tries to 
determine if M \= 4>, the BMC problem restricts itself to determining whether M \= p 
on the first k iterations of M. The class of properties that can be checked this way is 
smaller than the one handled by model checking, as described in Section 2. 

The BMC problem is usually solved by reducing the model and the bug detection 
circuit, unfolded k cycles, to a propositional formula, and then solving this formula using 
a SAT solver. However, other approaches are also applicable. Bertacco and Olukotun [6] 
suggest a BDD-based algorithm that unfolds the sequential circuit k times in order to 
calculate the values of signals on the first k cycles. This algorithm is based on symbolic 
simulation methods [8], and has some advantages over the SAT approach described in 
[7]. The main advantage is that the unfolded structure uses BDD variables only for inputs 
to the model. Therefore, when the number of inputs is small compared to the number of 
state variables, as in the case of datapath, this approach is advantageous. In this paper, 
we describe an optimized BDD-based BMC engine, based on this unfolded structure. 

2 Basic Concepts 

We consider bounded model checking to be the following problem: given a nondeter- 
ministic Finite State Machine (FSM) M, n RCTL [4] properties (^i , . . . ,(j>n) and a 
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Fig. 1. An FSM 



Fig. 2. An unfolded FSM 



bound k, we want to check if each of (^i, . . . , holds in the first k cycles of M. The 
FSM consists of parts originating from the following sources: a synchronous hardware 
design to be verified and a nondeterministic environment that defines restrictions on the 
inputs to the design. In addition, for each property 4>rn S (^i, . . . , 4>n), 4>m is translated 
to an automaton and a formula of the form AG{pm), where Pm is a Boolean expression, 
as described in [4], and both the automaton and pm are included in the FSM (each pm 
is an output of a gate). Nondeterministic behavior is translated to free inputs. 

An FSM can be defined by the following 6-tuple (CCq, /qi CC, I, S, P): 

• CCq is combinatorial logic that generates the initial states of the flip-flops. 

• /q = (*( 1 , 0 ) j • ■ ■ ! *(t,o)) is an ordered set of Boolean inputs to CCq. 

• CC is combinatorial logic that generates the next state function of the flip-flops. 

• / = (*i, . . . , *q) is an ordered set of Boolean inputs to CC. 

• S' = (si , . . . , Sr) is a set of symbols representing the outputs of the flip-flops. 

• P = (pi, . . . ,p„) is an ordered set of Boolean outputs representing the properties 

(^ 1 , . . . , (pn). 

{CCq, Iq, CC, I, S, P) is illustrated in Figure 1. 

3 BDD-Based BMC 

This section describes how an FSM is transformed into a combinatorial circuit that 
represents the first k cycles of the FSM, as well as the computation process applied to 
the combinatorial circuit in order to evaluate the properties in the first k cycles. 

3.1 Circuit Unfolding 

The unfolding process transforms an FSM, which is a sequential circuit, into an iterative 
logic array, as depicted in Figure 2. The combinatorial logic, inputs, and properties of 
the FSM are duplicated k times, and the flip-flops are replaced by wires connecting the 
copies of the different iterations. Therefore, the S parts do not actually exist; they are 
depicted only to indicate where the flip-flops existed previously. Assuming there are no 
combinatorial loops in CCq and CC of the original FSM, there are no combinatorial 
loops in the combinatorial circuit resulting from the unfolding process. 
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Definition 1 (Closed machine). The circuit that results from the unfolding process is 
called a closed machine. 

We use the netlist representation of the unfolded FSM as our basic data structure. This 
data structure is referred to as the circuit. 

3.2 Verification Using the BDD-Based BMC 

We use the following terms in the description of the computation process: 

• Cycle is the pair {Sj-i, {CCj IJ Ij IJ Pjf) (corresponds to cycles in calculations of 
FSM). This cycle is denoted as cycle number j. 

• Pm,j is the gate that represents property pm in cycle j. 

• pj represents the replication of a certain gate g of the FSM in cycle j. 

• The cone of a gate pj is the set of all gates on which gj topologically depends. 

• Afanin of a gate gj is a gate fji whose output is a direct input to gj. 

• A fanout of a gate gj is a gate hj" that has a direct input, which is the output of gj. 

Definition 2 (Gate fnnction). The function of a gate gj (denoted f[gj]) is the parametric 
representation of the gate gj depending on (/g, . . . , Ik). f[gj] operates on all of the FSM 
inputs (Iq X ... X Ik) and goes to {0, 1}, / : 5*+?*^ — B. 



Definition 3 (Frontier). The frontier F is a set of gates where for each gate g € F, 
two conditions hold: all of the fanins of g have a calculated BDD and the BDD of g is 
not yet calculated. 

The initial frontier is built by going backwards from the properties, until we reach 
primary inputs or gates for which there is a calculated BDD. (These gates were in the 
cone of influence of properties in previous cycles.) The fanouts of these inputs and gates 
compose the initial frontier. The frontier may change whenever we calculate a BDD of 
a gate. 

For each gate Pm ofp(i,i)) ■ • ■ we build the BDD that represents the function 

of the gate Pm.j- If the BDD of pm.j equals the function true, then pm holds in cycle j. 
Otherwise, we extract out of the BDD a non-satisfying assignment as a counter example. 
In order to calculate the BDD of Pm,j, we must first calculate the BDDs in the cone of 
Pm,j . When building the BDD of gj , we use the BDDs of all of the fanins of gj . Therefore, 
the structure of the closed machine dictates a partial order of calculation on the gates. 
Note that different copies of the same gate g in different cycles may have different BDDs. 

3.3 Advantages and Drawbacks of BDD-Based BMC 

The BDD-based BMC approach uses a parametric representation of the state of the flip- 
flops, depending only on the inputs of the model. That is, the set of reachable states 
in cycle j is represented by a collection of BDDs representing f[gj], for all gates gj 
that represent outputs of the flip-flops in cycle j. As a result, the BDD-based BMC 
is only sensitive to the amount of nondeterminism in the model. In contrast, symbolic 
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model checking and SAT solvers represent the states by state variables. Therefore, they 
are sensitive both to the amount of nondeterminism and to the number of state vari- 
ables. In addition, the functions computed by the BDD-based BMC describe the natural 
functionality of the original model. Symbolic model checking computes a characteristic 
representation of the reachable states, which is randomly shaped, and its BDD tends to be 
bigger than those of the natural functions. Another advantage versus SAT is that multiple 
properties are computed in the same run, without repeating calculations of overlapping 
cones of influence of these properties. SAT solvers need to backtrack after a counter 
example is found and thus repeat parts of the calculations. The main drawback of our 
approach is its sensitivity to the number of calculated cycles. In each cycle, q variables 
are added and therefore the complexity of calculation increases as the cycles advance. 
As a result of these advantages and drawbacks, the BDD-based BMC approach performs 
better than the other methods in wide and shallow circuits (i.e., circuits that have many 
state variables, but their state space can be covered by a few cycles) and in circuits with 
many state variables, but with a low amount of nondeterminism. 

Due to the static unfolding, the circuit is amenable to static BDD variable ordering, 
based on its topology. In many cases, this order is sufficient for calculation without a 
need for dynamic BDD reordering. We can also simplify the evaluation of the properties 
by performing easy calculations before the difficult ones. Our measure of difficulty is 
the expected BDD size of the gate, which we estimate according to the sizes of the input 
BDDs. We traverse hrst the easier calculations paths, and in many cases, as a result of 
constant propagation during the computation process, some more difficult calculations 
that were not yet performed become redundant. 

4 Open Machine 

We will now introduce a variation of the unfolding algorithm, which enables powerful 
optimizations to the BDD-based BMC engine, as will be described later. Additionaly, 
this variation enables us to prove properties in some cases, despite the fact that we are 
calculating only a bounded number of cycles. 

Definition 4 (Open machine). An open machine is a closed machine whose logic CCq 
is replaced by free inputs, as depicted in Figure 3. 

These free inputs are denoted with if . Note that the number of inputs in if may be 
different from the number of inputs in /q. 

4.1 The Difference between the Open Machine and the Closed Machine 

Let f°P [gf denote a gate function in the open machine, and [gf denote a gate function 

in the closed machine. 

Definitions (Equivalence between gate functions). Two gate functions f[gx] and 
f[g'y\ are equal, if and only if the BDD of gx equals the BDD of gf This equivalence is 
denoted by f[gx] = f[g'y]- 

Note that f°^\gj] is not necessarily equal to f^^\gj\- 
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Fig. 3. Open machine 



Ifn.] = rngy], then^,>^ n.+A = n'y+j] and = 

For proof see [14]. Note that a closed machine version of Theorem 6 does not hold, 
i.e., if f^’’[gx] = we cannot conclude anything about other gates in the closed 

machine or in the open machine. 

Corollary 7 It stems from Theorem 6 that if f°^\gx\ = b.b € |0,1|, then 

^f>o f°^[gcc+j] = b and r’^[gx+j] = b. 



Theorem 6. 



4.2 Uses of the Open Machine 
Proving Properties 

In some cases, Theorem 6 gives us the ability to prove properties, despite the fact that 
we are calculating a bounded number of cycles. We prove (pm by calculating the BDD 
of pm,j for all j = 1, . . . , fc in the open machine. Calculation is performed in the same 
manner described for the closed machine. If we find that the BDD of pmj equals true 
for some 1 < j < fc, we can conclude that (pm holds both in the open machine and in 
the closed machine for all cycles >= j. As described in [9], we can prove a property in 
a bounded circuit in this way only if the circuit is k-definite in respect to the property 
(i.e., the property in each cycle depends only on inputs of at most the last fc cycles). 
While the method in [9] is performed only in order to try and prove properties, we use a 
more general characteristic of the open machine (introduced in Theorem 6) mainly for 
optimizations, as described in the next subsection. 

An induction-based algorithm, based on a SAT solver, is suggested in [ 1 2] for proving 
safety properties. We chose a different approach in order to accommodate large, real- 
world, circuits. Our method is suitable only for a subset of the circuits for which the 
method in [12] is suitable. However, our method can be efficiently implemented using 
the BDD-based BMC. 

Optimizations Based on the Open Machine 

Before applying the computation process to the closed machine, we perform two pow- 
erful optimizations that simplify further calculations, based on the open machine: 
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1. Constant propagation. There are constant signals in the FSM that originate in 
restrictions of the environment on the design’s inputs. When we find that is the 
constant b in the open machine, we automatically propagate b to all gj, for j > i, 
both in the open machine and in the closed machine, according to Corollary 7. Due 
to the special data structure, described later, the time complexity of the propagation 
is independent of k. 

2. Logical equivalence. If a gate g is k-definite, the copy of g in cycle j has the same 
BDD as the copy of g in cycle j + k, for all j> 1. Another case in which different 
gates have equal BDDs occurs as a result of logic duplication in the original model. 
We find in the open machine sets of gates with equal BDDs and gather them in 
equivalence sets. Each equivalence set actually represents up to an infinite number 
of equivalence sets, since the next cycle replications of the gates in each equivalence 
set are also an equivalence set. When the computation process runs in the closed 
machine, we only calculate one BDD for each equivalence set. 

Data Structure for the BDD-Based BMC 

Our data structure represents both the closed machine and the open machine. While our 
implementation of the data structure conceptually allows us to perform operations on 
each of the 2 x k replications of each gate g at any time, initially there is only one 
object (whose size is independent of k) in the data structure for every gate g of the 
original FSM. This representation may change as various operations are performed on 
the circuit. As a result, the common size of the objects representing the replications of 
g may grow and, in the worst case, depend on k. In practice, most of the data structure 
remains folded during the entire run. When an operation is performed on a gate gj in 
the open machine, it also applies to all of the relevant gates of the subsequent cycles, 
according to Theorem 6. In most cases, the time complexity is independent of k, since 
all of the relevant gates are a single object in the data structure. 

5 Under-Approximation 

Despite the simplification methods and despite applying reordering algorithms, the 
BDDs can still grow as the cycles advance and may eventually outgrow the mem- 
ory resources. One solution is to perform under-approximations, although this com- 
promises on coverage. Each under-approximation is performed by choosing an input 
ii G Ij ■ 0 < j < k (denoted iij) and setting it to a constant value b G {0,1} for the rest 
of the run. Next, we simplify the already calculated BDDs accordingly. The heuristics 
we use to choose iij and b, try to find the best variable assignment that will balance 
between causing a significant reduction in the BDDs sizes and leaving many behaviors 
in the scope of the calculation. The heuristics also take into account that if iij was set 
to b and we are performing a new under-approximation, then we prefer not to choose 
any of the inputs for f ^ 0, or if we choose one of them, then set it to -•b. In this 

way, we degenerate the behavior of an input only in a specific cycle, rather than for the 
entire run. Examples of heuristics for choosing iij and b appear in [14]. Running the 
computation process with under-approximations is especially useful for finding bugs 
that, on one hand occur after many cycles, and therefore an exhaustive search would 
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Fig. 4. Optimized BDD-based BMC versus SAT 



be difficult, and on the other hand are quite common (occur for many possible sets of 
inputs) and therefore can be found even when the search is partial. 

We also implemented a mode that combines under-approximations with backtrack- 
ing, to perform exact evaluation of the properties. In this mode, whenever reaching the 
cycle bound, we backtrack and compute parts of the search space which were neglected 
as a result of previous under-approximations. 

6 Experimental Results 

We implemented the optimized BDD-based BMC in the framework of IBM’s model 
checker RuleBase [2], and used the CUDD package [1] for BDD calculations. The table 
in Figure 4 presents the results of our engine versus an IBM zChaff-based SAT solver. 
The engines ran on real-life examples taken from various projects. Both engines operated 
using default configurations. We set a timeout of 36000 seconds, memory limit of IG, 
and a bound of 100 cycles. 

The number of inputs, flip-flops, and properties is shown for each circuit. The total 
run-time is in seconds and the memory is in MB. The cycles column is the number of 
cycles the engine calculated until reaching either the cycle bound, timeout, or mem- 
ory limit, or until all properties failed. The res column displays whether the engine 
managed to disprove the properties. The # app column displays the number of under- 
approximations performed during the computation process. We also ran several symbolic 
model checkers on these examples, all of these outgrew memory resources on designl 
to designs, while computing the set of initial states. When under-approximations were 
used, we report, in parentheses, the time and memory consumption of the run without 
under- approximations. These results demonstrate the significant decrease in time and 
memory demands our under- approximations achieve. 

(*) The SAT solver reached timeout after 70 cycles in each of the 15 runs. 

(**) The SAT solver reached timeout while constructing the CNF formula. Using a SAT 
expert advice, we ran the SAT solver without the bounded cone of influence reduction. 
With this configuration, it found a counter example for the first property after 189 sec- 
onds and for the second property after 139 seconds — about 10 times slower than the 
unfolding engine (combining the run-time of the two properties). 

The table in Figure 5 reports the run-time in seconds of the constant propagation 
performed on the FSM unfolded 100 cycles. The open and closed machine column 
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Fig. 6. Construction time 



presents the run-time of constant propagation, as it is performed in our optimized engine 
— first on the open machine (according to Corollary 7) and then on the closed machine. 
Note that constant propagation on the open machine changes both the topology of the 
open machine and of the closed machine. The only closed machine column presents 
constant propagation as it would have been performed in a standard implementation 
(i.e., only on the closed machine). We conclude that there is a significant decrease in 
run-time when performing constant propagation on the open machine. Note, that in many 
cases, constant propagation on the closed machine alone dominates the running time and 
may even cause timeout. 

The table in Figure 6 reports the run-time for each circuit in seconds of unfolding 
the FSM k cycles, out of the netlist representation of the original FSM, for fc = 100 and 
for k = 300. This table demonstrates the fact that, due to our data structure, the circuit 
unfolding time does not have a linear dependency on the cycle-bound k. 
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Abstract. We use symbolic simulation for the verification of high level 
circuit specifications. We combine Mathematica for algebraic computa- 
tion and ACL2 for branching decision to increase the efficiency of the 
method. 



1 Introduction 

Symbolic simulation, proposed as early as 79 by J.Darringer, is intermediate be- 
tween conventional simulation and mathematical reasoning, to verify abstract, 
pre-RTL design specifications. Instead of simulating a design with numerical 
values, symbolic inputs are given to the symbolic simulator, which produces an 
algebraic expression for the memory and output variables, as a function of the 
initial state and of the inputs. These difficulties arise: (1) the symbolic expres- 
sions may become exponentially large in the number of simulation cycles; (2) in 
the presence of conditional statements, when the condition is a symbolic term, 
all alternative paths must be explored. The simulator generates a simulation 
tree, which may also grow exponentially; (3) the automatic simplification and 
reduction of the computed symbolic expressions is needed, else the outputs of 
symbolic simulation are unreadable. 

Previous works have tackled one or more of the above difficulties: e.g. GSTE 
[9] at switch and gate-level, PVS [7] and AGL2 [4] at the initial abstract design 
levels. To simplify symbolic simulation by reducing algebraic expressions and 
controlling the expansion of the simulation tree, most proposed solutions use an 
automated reasoning tool. 

A systematic approach for using AGL2 as a symbolic simulation engine was 
proposed by J. Moore [6]. On this base, the semantics of a subset of VHDL [3] 
were defined in AGL2 in order to simulate a VHDL design symbolically [2]. In 
this paper we propose a different approach based on the separation of algebraic 
computation and branching decision. We combine Mathematica [8] a computer 
algebra system and AGL2 [4] an automatic theorem prover to perform what we 
call constrained symbolic simulation. This association increases the efficiency 
of the symbolic simulation by using two tools, each one being powerful in its 
domain. 
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Fig. 1. Overview of the method 



2 Overview of the Method 

Figure 1 shows the overall combined verification system taking VHDL inputs. 
The front-end compiler performs syntactic and static semantics checks, and 
serves as common starting point to all EDA tools. NIF is an intermediate format 
developed by our group. The elaboration of the Mathematica model, called M- 
Code, is performed on the NIF file. During this step, data type restrictions are 
extracted as constraints. Before starting the simulation, the user, who is not nec- 
essary a proof expert, can add constraints on the inputs. Those are inequalities 
or equalities between expressions composed of design variables or input signals 
and arithmetic operators (-1-,— ,/, x). M-code and constraints are submitted 
to Mathematica for n simulation cycles, n is user defined. During simulation, 
symbolic expressions are simplified using rewrite rules. Standard Mathematica 
simplification rules are algebraic axioms like {x — x — > 0) and arithmetic sim- 
plifications like (n + n — > 2n), for terms defined on real or integer types. VHDL 
simplification rules were defined by us for the hardware types unknown to Math- 
ematica (e.g. Bit). To reduce the simulation tree, whenever path conditions are 
encountered, ACL2 is called as a reasoning engine. ACL2 evaluates a given con- 
dition under simulation constraints using pre-proved theorems. Depending on 
the ACL2 answer, Mathematica chooses a path. After each simulation cycle, the 
values of all variables and signals are stored in a file. This is the result of the 
constrained symbolic simulation of the VHDL description. 
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Table 1. Example of stabilizing concurrent assignments 



cycle 


VHDL expressions 


1 


a <=(d and not(c)) or (b and c); 
b <=(a and not(c)) or (d and c); 


2 


a <=(d and not(c)) or ((a and not(c)) or (d and c) and c); 
b <=((d and not(c)) or (b and c) and not(c)) or (d and c); 


3 


a <= d; 
b <= d; 



3 Modeling VHDL in Mathematica 

The VHDL supported by our tool is based on the standard subset for Register 
Transfer Level (RTL) synthesis [3], enlarged with full arithmetic types. Com- 
binational logic and clock-edge synchronized sequential logic may be described 
using a behavioral, structural or dataflow style, or any combination thereof. A 
model is a component, i.e. an entity coupled with its associated architecture. 

Due to the absence of explicit time [3], the simulation algorithm is simplified, 
as described in [2]: the driver of a signal only holds one current and one next 
value, since right hand side waveforms are a single zero delay expression (the 
after clause is not recognized in the subset). Concurrent signal assignments and 
combinational processes are stabilized by performing delta computation cycles 
between each two clock simulation cycles. In this context, the model is observable 
only at the clock cycle level. 

In the M-code, a VHDL component built from a (entity Ent, architecture A) 
pair is modeled by a Mathematica function named: EntA. Its arguments are all 
the objects declared in the corresponding entity-architecture: input, output and 
local signals, and local variables. All are named Mathematica blank patterns, 
i.e. no data type is defined. However, the information about data types is not 
lost: it will serve as simulation constraints. 

Two Mathematica variables are necessary to model each local or output 
signal: one for the current value, passed to EntA as argument; one for the next 
value, declared as temporary variable inside the body of EntA. Input signals, 
that cannot be modified in the architecture, only have a current value. 

The body of EntA is the Mathematica model for the VHDL statements 
inside the architecture. All processes are flattened inside the body. To eliminate 
simulation delta cycles, we perform a symbolic fixed point computation during 
M-code generation. We repeat the execution, symbolically and sequentially, of all 
concurrent signal assignments, and simplify the expressions, until they stabilize. 
The next values of all signals can then be computed in one step. 

Table 1 displays the three cycles needed to stabilize the symbolic value for the 
concurrent assignments shown at cycle 1. In the M-code symbolic delta cycles 
are no more needed. The corresponding M-code for this example is: 
NextSig[a,d] ; 

NextSig[b,d] ; 
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Table 2. Examples of M-code assignment functions 



VHDL 


M-code 


A <= d -t g; 


NextSig[Anext, Plus[d, g]]; 


V:=2 + j; 
Q := V -b 1; 


Change Var[V, Plus[2, j]]; 
Change Var[Q, Plus[V, 1]]; 



Table 3. Syntax of VHDL branching statements in M-code 



VHDL 


M-code 


If B then state-bloc- 1 


If [B, state-blocl 


else state-bloc-2 


, state-bloc-2 


end if 


,decideACL2] 


For I in start to end loop 


For [ Set[I, start], Equal[I,end], Incr[I] 


Statements 


decideACL2, Statements] 


End loop; 


(^Comment: B = Equal[I,end] *) 



At this stage, the body of EntA contains only sequential statements: as- 
signments, conditionals or instantiations of components. Each one of them is 
represented by a function in Mathematica syntax. 

An assignment is modeled by NextSig for signals and ChangeVar for vari- 
ables (Table 2). NextSig assigns the next value of the signal while ChangeVar 
assigns the variable directly. NextSig[Sig, terms] or ChangeVar\Var, terms] 
also create rewrite rules [5] that transform Sig or V ar to terms. These rules are 
not applied during M-code generation, but during simulation. 

Branching statements are modeled by functions in which their semantics con- 
sider a three state logic (Table 3). When B is a symbolic formula that cannot 
be evaluated to true or false by Mathematica, ACL2 is called to decide B un- 
der constraints. Details about the decision procedure are discussed in the next 
section. 

4 Simulation Algorithm 

First, all objects are initialized with their values according to their VHDL dec- 
laration. The consistency of simulation constraints is verified by ACL2. After 
that, the M-code function is executed NbCYCLE times (NbCYCLE is user defined). 

At each simulation cycle, the function Test-vectors can be customized to 
generate specific inputs; for instance, reset signals can be active in the first 
simulation cycle, inactive otherwise. Then, the EntA function is interpreted in 
Mathematica, where two operations are performed: simplification of terms and 
branch decision. At the end of each cycle an execution tree is generated, which 
contains all symbolic values for each signal and variable in the design. 

4.1 Computation of Terms 

When assignment functions NextSig[Sig, terms] or ChangeVar[Var, terms] are 
encountered, right hand side terms are simplified into terms/, using standard 
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Initialize (Sin, Sout .Slocal , Vlocal) 
Verify-by-acl2 (Constraints) 

For cycle :=1 to NbCYCLE do 

Test-vectors (Reset , cycle , Sin) 

EntA (Reset , Sin, Sout ,Slocal , Vlocal) 
Print-Tree (Sin, Sout ,Slocal ,Vlocal) 
End for; 



Fig. 2. Simulation algorithm 



MATHEMATICA ACL2 




Fig. 3. Branch decision scheme 



Mathematica and static VHDL rules. Then, the left hand side Sig or V ar is 
assigned with terms! and the rewriting rule Sig — > terms! or V ar — > terms! 
is added to a library called dynamic VHDL simplification rules. Those rules are 
now available to simplify all successive assignments. This on the fly simplification 
of terms is essential for time and memory efficiency. 

In Table 2, ChangeVar\V, Plus^, j]] assigns V with P/us[2,j] and creates 
the rewrite rule V — > 2 + j. In the next assignment {V + 1) is simplified 
using (V — > 2 + j). Then, Q is assigned with 3 + j. Finally, the rewrite rule 
{Q — 3 + j) is created. 

4.2 Branch Decision 

During simulation, Mathematica, whenever it cannot decide a branch condition, 
calls ACL2. Figure 3 shows the principle of their interaction. 

First, Mathematica asks ACL2 to check the consistency of the set of simula- 
tion constraints Lh- Function check^consistency takes Lh as input and returns a 
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minimal set of contradictory hypothesis Ih, or the empty set. If Ih is not empty, 
the simulation is stopped and the contradiction is shown to the user. 

If Ifi is empty, Mathematica sends ^ B to ACL2. If ACL2 finds a proof, 
it returns Q.E.D; the ’’true” branch is considered for simulation. If ACL2 fails 
or is not able to find a proof in a given time, it returns Failed. In this case, 
Mathematica sends Lh ~'B. If it succeeds, the ’’false” branch is considered for 
simulation. Otherwise, the simulation stops and the user is asked for more con- 
straints. If more constraints are given, simulation is reinitialized. Otherwise, the 
symbolic simulation forks into two branches, one assuming the branch condition 
is true and the other its negation. 

Branch decision is generally not decidable. However, most cases are limited 
to equalities and inequalities formulae, and resolved by using some pre-proved 
theorems on them (written as ACL2 books). At each cycle the proved theorems 
are added to the ACL2 database and they are available for the future proofs. 

Example Euclid’s GCD algorithm (Table 4): 



Table 4. Euclid’s GCD algorithm 



VHDL 


M-code 


PI: process begin 


GCDmath [CLK_ , RST_ , a_ , b_ , 0K_ , 




res_,aO_,bO_,cO_] : = 


wait until clk=’l’; 


Module [ , 


if RST=’l’ then 


If [RST==1, 


II 

o 


Change Var [aO , a] ; 


II 

o 


ChangeVar [bO,b] ; 


ok<=False ; 


NextSig [OK , False] 


elsif aO=bO then 


,If [Equal [aOjbO] 


ok<=True ; 


.NextSig [OK, True] ; 


res<=aO ; 


NextSig [res , aO] 


elsif aO>bO then 


.If [aO>bO 


aO : =aO-bO ; 


, ChangeVar [aO , aO-bO] 


else bO:=bO-aO; 


, ChangeVar [bO , bO-aO] 


end if ; 


,decideACL2] 


end process PI; 


,decideACL2] 




,decideACL2] ] 



Before beginning the simulation, the function Test -vectors has been cus- 
tomized to generate an active reset at the first simulation cycle and inactive 
hereafter. The initial values are a = 3n and b = n and the constraints are 
Lh = {n G A/"*}. The simulation of four cycles runs as follows. 

At Cyclel, RST has the numeric value 1 and ag and bg are assigned with ini- 
tial values 3n and n. In all subsequent cycles, RST is set to 0 and Mathematica 
will always decide to simulate the ’’false” branch of the first if —then— else state- 
ment. We do not mention it anymore. At Cycle2, Mathematica cannot decide if 
ag is equal to bg, i.e. if 3n is equal to n. So, it calls decideACL2, which works 
as shown on Figure 3. The constraint {n G N*} is transformed into the ACL2 
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1st simulation cycle 




ok^true 

Res:=ao 



2nd simulation cycle 



ifa„=(b„-a„) 



•4T \ ok^tnie 

ifao>(b„-a„) Res:=a„ 



if(ao-bo)=bo 



■n U ^■.U ok<=^trae 

if(a„-bo)>b„ Res:=a„-b„ 



3rd 



bg: (b()-aQ)-ao ^o- ^~(6 q~^o) bgi— bQ-(ag-bg) ^o-~(^o~^o)''bo 

if ag— (bg-ag)-ag if ag-{bQ-ag)=bo-ag if ag-bg=(l^-aQ)-ag if (aQ-bQ)-bg=bg 

ok<^rue /bk<=tr&B /ok<^me bk<=5tt'e 

Res:=ag / Res:=ag-(bo-ag) / Res-ag-bg /Res:=(ag-bg)-bo 

if a„>(b„-a„)-a„ if a„-(b„-aa)>(b„-a„) if a„-b„>(b„-a„)-a„ if (ao-bo)-b„>b„ 




^o: V((Vao)-aoV /ao:=(ao-boHV(V)>i)) 

/a„:=(a„-(bo-a„))-(bo-^ / a„:=((a„-b„)-b„)-b„ 

bo'“((bo-ao)-ao)-aQ / b(,:=(b|,-(aj-b(,))-(a|,-bj)) 

bo~(bo-afl)-(ao-(bo-ao)) b,,:=b|,-((ao-b|,)-b|,) 



Fig. 4. Execution tree of the GCD example 



list {{integerp n) (< 0 n)) and its consistency is checked. As ACL2 returns an 
empty list of contradictions Ih, Mathematica sends the following ’’defthm event” 
to ACL2 : 

(defthm branch- 1 

(implies (and (integerp n) (< On)) 

(equal (* 3 n) n))) 

Because the ACL2 answer is ’’Failed”, Mathematica sends the event : 

(defthm branch-l-negation 

(implies (and (integerp n) (< On)) 

(not (equal (* 3 n) n)))) 

ACL2 answers ’’Q.E.D”, Mathematica considers the ’’false” branch for simu- 
lation and simplifies Oq — t»o to 2n. The reader may be surprised by the simplicity 
of the theorems, but without ACL2 Mathematica is not able to prove them. At 
Cycles, oo is simplified to n and at Cycled ACL2 answers ’’Q.E.D” to the event: 

(defthm branch-4 

(implies (and (integerp n) (< On)) 

(equal n n) ) ) 

As four cycles have been simulated, the simulation is stopped. Figure 4 shows 
the execution tree without any constraints. With constraints, only the bold path 
is simulated (reset has been omitted). 
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5 Discussion and Conclusions 

Our prototype system for Constrained Symbolic Simulation takes advantage of 
the best qualities of two powerful automatic systems: Mathematica to simplify 
algebraic expressions, and ACL2 to decide the truth value of expressions under 
a set of hypotheses. Clock synchronized sequential circuits and delay-free com- 
binational circuits, written in a synthesizable VHDL subset, are automatically 
translated into a M-code file, its simulation model. 

The automatic generation of proof obligations for ACL2, under the form 
of “defthm events” is implemented. Mathematica and ACL2 are executed as 
concurrent processes, and communicate via a pipeline. Our technique efficiently 
prunes the execution tree, and proves VHDL assert statements [1] on small 
circuit blocks; we are working on bigger systems, like the AMBA architecture. 
We intend to extend our method to more abstract specifications, as describable 
in the next version of the VHDL subset for system-level synthesis, or SystemC. 
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Abstract. We propose a debugging method for data-path intensive sys- 
tems, in particnlar, memory systems. The approach is based on strength- 
ening invariants by deriving constraints on data in the design using sym- 
bolic simulation with constrained inputs. A new heuristic is introduced 
for finding the appropriate input constraints for the symbolic simulation. 
We give up soundness in order to gain more automation and efficiency, 
minimizing or even eliminating the required manual effort. While it is 
no longer possible to prove the correctness of the design, experimental 
results demonstrate that the technique is quite effective in finding design 
errors. 



1 Introduction 

Most hardware systems of interest today are much larger than what can be 
reliably tested by conventional methods, and some form of formal verification 
becomes a necessity. In order for non-expert users to be able to apply formal 
methods, the tools must be mostly automatic. Some of the most successful ap- 
proaches to date are model checking [5], theorem proving [11], and validity check- 
ing [12]. However, these approaches are often applicable only to relatively small 
systems, or require significant manual guidance. 

In this paper, we are interested in verifying memory systems and similar data- 
intensive designs. Due to the large sizes of data structures used in memories, we 
model them as infinite systems. Proving the correctness of such designs usually 
boils down to proving an invariant. The approach we propose can be used in 
verifying arbitrary safety properties which can be expressed as invariants. 

The standard way to prove invariants for infinite systems is by induction over 
time. Most of the time, however, the invariant we want to prove is not inductive 
and has to be strengthened. Often, invariants are strengthened manually in a very 
tedious iterative process that requires experience and familiarity with the design. 
This is the most difficult and time consuming part of the verification process. 

* This research was supported by GSRC contract DABT63-96-C-0097-P00005, by Na- 
tional Science Foundation CCR-0121403, and by King Fahd University of Petroleum 
and Minerals, Saudi Arabia. The content of this paper does not necessarily reflect 
the position or the policy of GSRC, NSF, or the Government, and no official endor- 
sement should be inferred. 
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Many techniques have been proposed in the literature to partially automate the 
process of strengthening invariants [7,8,6,13,14,3,9,10,2,15,4]. 

In a previous work [1] we introduced a method for strengthening and prov- 
ing invariants by the technique called consistency testing which uses symbolic 
simulation. In that method, the user may have to supply the consistency test 
manually, and the tool then constructs the remaining part of the inductive in- 
variant, proves it, and verifies that the supplied test satisfies certain properties 
to guarantee soundness. 

In this work, we propose a similar method, but without the soundness check 
and with a simplified induction scheme. The consistency test is replaced by input 
constraints constructed automatically using a special heuristic. This results in 
a potentially unsound method, but it becomes completely automatic and serves 
as a very efficient debugging tool. Besides skipping the soundness check, the 
efficiency is also gained by reducing the number of cycles in symbolic simulation 
compared to the previous method. We use CVC [12] as a symbolic simulator and 
a validity checker in our experiments. 

Our approach is based on the empirical observation from several examples 
that most of the invariants in data path intensive systems can be obtained by 
symbolically simulating the system for a few cycles with specific inputs. The 
inductive step is then proven only for the states that can be reached by such 
symbolic simulation, instead of for all reachable states. In order to complete the 
proof, we need to show that all the reachable state are included in this set of 
states. However we do not discuss this problem in the paper. 

Instead, we give up this soundness check and propose our approach as a 
debugging tool. We tested the effectiveness of this approach by applying it to 
several examples of memory systems. In all the examples we considered, it was 
able to find all design errors in addition to several errors we inserted to test the 
effectiveness of our approach. This gives us confidence in the effectiveness and 
reliability of our approach as a debugging technique. 

The paper is organized as follows. Sections 2 and 3 formally introduce in- 
duction on time and functional equivalence, followed by a detailed description of 
our verification technique in section 4. An automatic technique for finding input 
constraints is given in section 5. Section 6 concludes the paper. 

2 Induction on Time 

We model a hardware design as a transition system T = (S', sq, N, R, Dm, Dout), 
where S is a non-empty (and possibly infinite) set of states, sq G S is the initial 
state, Dm and Hout are the domains of inputs and outputs, N : S x Dm — >■ S 
is the transition function, and R : S x Dm — >■ TIout is the output function. We 
write N{s, a^) to denote the final state of running T on the input sequence 
of length £ starting from the state s: 

N{s, a^) = N{N { . . N {s, oo), ai), . . . , ai-i). 

i 
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It is important to note that a single transition in T can actually represent a com- 
plex transaction in the real hardware implementation requiring multiple cycles 
of execution. 

A state s is called reachable in a transition system T, if there is an input 
sequence such that s = N{sq, a^), where sq is the initial state of T. In this 
paper, we only consider safety properties, or invariants over the set of reachable 
states. We say that a transition system T satisfies a safety property Q{s), if (5(s) 
holds for every reachable state s of T. This can be stated as follows: 

\/e,a^.Q{N{so,a^)). (1) 

The conventional way of proving (1) is by induction on time, when Q is first 
shown to hold in the initial state sq, and then the transition function N is shown 
to preserve Q: 

Q{so), \/s,a. Q{s) ^ Q{N{s,a)). (2) 

In practice, this induction scheme requires finding an inductive the invariant, 
which is often the hardest and most tedious part of verification process. 



3 Functional Equivalence 

We prove correctness of systems using the idea of functional equivalence. The 
problem is stated as follows. Given two systems, the concrete system T'^ (the 
system we want to verify) and the abstract system T“ (which defines the required 
functionality of T'^), prove that is functionally equivalent to T“. Two systems 
are said to be functionally equivalent if they produce the same sequence of 
outputs for the same sequence of inputs. Formally, this is expressed as follows: 

V£,a^A. R^{N%sl,a^),\) = i?“(fV“(sS, a^). A). (3) 

If we define Q{s) to be VA. i?“(s“,A) = i?°(s'^,A), (3) becomes 

\/^,a^.Q{N{s,a^)), which is the same as formula (1). So, we can use the same 
induction principle given by (2) to prove the functional equivalence (3) of the 
two modules. 



4 The Verification Method 

In this section we introduce our approach through a simple example. We show 
how the direct use of (2) to prove the correctness of a memory system fails. Then 
we show how our method can be used to deal with the problem. 

Consider a small example of a read-only memory with a single-line cache 
given in figure 1(a). To verify the correctness of this design, we show that it 
is functionally equivalent to a simple (uncached) array of data in figure 1(b). 
Since the memories are read-only, the input to both modules is the address 
(An = Addr), and the output is the data read from that address (Aut = Data). 
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The transition systems T‘^ and T“ are defined as follows. The abstract state 
of T“ is just an array M indexed by Addr and holding the Data elements. 
The next state function iV“ is the identity function, and A) = M[A]. The 

concrete state of contains the state of the cache in addition to the same 
array M . Initially, in Sq, some arbitrary address is cached such that the cache is 
coherent with the main memory M. The next state function A) adds the 

address A and the data stored under that address M[X] to the cache, yielding 
the new state. The output function A) is similar to iV'^, except that it 

returns the data associated with the address A. 

Unfortunately, proving the functional equivalence of the two memories by 
simple induction fails. Consider the state in figure 1 (a) and (b) where a ^ b and 
a = e. In this case, and are functionally equivalent and hence the induction 
hypothesis Q{s) is satisfied. However, transitioning to the next state by reading 
some address ct yf tt brings T‘^ to a new state s‘^' , shown in figure 1 (c), where 
the address tt is no longer cached. Therefore, reading tt again yields 6 yf a, which 
no longer agrees with T“. The induction fails in this case because it starts out 
from an incoherent state, which is not reachable. The natural way to strengthen 
the invariant is to require the state to be coherent. In this example it means that 
the cached value must be the same as in the main memory. So, in general, we 
can strengthen invariants for such systems by asserting their coherence. 

Now suppose we simulate the incoherent state for one step with the input 
constraint C(s'^, a) = a yf tt. The resulting state s^' is shown in figure 1(d). 
Clearly, state s‘^' is coherent, and the induction (2) for such a state is valid. 
Formally, (2) is restricted to the set of states S' defined as follows: 

T-' = {(s“, s'") I 3s"", a. s"' = N"{s"", a) A a yf tt}. 

The induction (2) with S' becomes: 

Q(so), Ws € S',a. Q{s) ^ Q{N{s,a)). (4) 

Proving (4) does not complete the proof of correctness for the memory sys- 
tem; it simply says that the concrete system behaves according to the specifica- 
tions when started from any state in S'. To complete the proof of correctness. 
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we need to prove that all reachable states S are included in S'. That can be 
done by proving the following induction: 

^'(so), Vs, a. I7'(s) ^ S'{N{s, a)), (5) 

In general, (5) is undecidable. For some memory systems, however, proving it 
can be a matter of a simple intuition of the designer. For cases where we fail to 
prove (5), our approach can still be used as an effective debugging tool. For the 
cache example, it is easy to show that (5) is valid and that completes the proof 
of correctness for this example. 

The general idea in our approach is to find an input constraint C(s, a^) on 
an input vector that when executed on an arbitrary state s will remove the 
incoherences in it. For instance, in the example above, the read from a tt 
removes the incoherence by causing c to be copied from the main memory to the 
cache. 

5 Finding Input Constraints 

Data path intensive systems consist mainly of registers interconnected by buses 
(or links). Each link has a condition or predicate associated with it. When the 
condition is true, the data is transfered along the link. In any system transition 
many data transfers may happen. These data transfers imply some constraints 
on the state of the system. In the cache example, the data transfer from the main 
memory to the cache implies the constraint that the cache and the main memory 
are always coherent. In general, we can control which data transfers happen in 
each system transition by constraining the inputs. In the cache example, we 
constrained the input by a tt. 

Our heuristic looks for the right input constraints that will exercises the right 
links and get the data synchronized. The idea is to look at counterexamples of 
failed proofs. Suppose we try to prove 

Vs, cr.[Q(s) ^ Q{N{s, cr))]. (6) 

If the proof fails, we get back a counterexample C. Intuitively, C defines the data 
transfers that contributed to the failure of the proof. Based on our assumption, 
the proof failed due to incoherences between the data involved in these transfers. 
If we simulate s for one transition and exercise the same links as in C, we are 
likely to get rid of these incoherences. Let Ci be the condition associated with a 
link li. If li is activated in C, its condition becomes true in C. Let S' be the 
set of states where every Ci holds for each link li activated in C. That is, the 
input constraint becomes C{s, a) = /\j c^. Then we try to prove: 

Vs'Gi;',a.[Q(s')^Q(^(s',CT))]. (7) 

By simulating s with the constraint C(s, a), it is likely that we will get rid of 
the incoherences. If (7) is not valid, we get a new counterexample and repeat 
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the process. If at any point we get a counterexample with the same set of acti- 
vated links as in any previous counterexample, we report it as a potentially true 
counterexample. The user can also put a limit on the number of iterations to 
guarantee termination. 

6 Conclusion 

In this paper, we presented an automatic technique for finding design errors in 
memories and data path systems. The method is based on a semi-formal version 
of invariant checking using symbolic simulation with automatically generated in- 
put constraints. We tested the method on various types of memory systems (one 
and two-level direct-mapped cache, set-associative cache, and a memory system 
with SDRAM controller), and the method found all the bugs in these designs 
without any manual effort, which demonstrates its effectiveness. The longest 
runtime was for the two-level direct-mapped cache, and it took 10 minutes on a 
machine with a 800MHz Pentium processor. 



References 

1. Husam Abu-Haimed, Sergey Berezin, and David L. Dill. Strengthening invariants 
by symbolic consistency testing. In CAV’03, volume 2725 of LNCS, 2003. 

2. Saddek Bensalem, Yassine Lakhnech, and Hassen Saidi. Powerful techniques for 
the automatic generation of invariants. In CAV’96. 

3. Nikolaj Bjprner, Anca Browne, and Zohar Manna. Automatic generation of invari- 
ants and intermediate assertions. In Theoretical Computer Science, 1997. 

4. Jerry R. Burch and David L. Dill. Automatic verification of pipelined micropro- 
cessor control. In CAV’94- 

5. E. M. Clarke, E. A. Emerson, and A. P. Sistla. Automatic verification of finite- 
state concurrent systems using temporal logic specifications. ACM Transactions 
on Programming Languages and Systems, 8(2):244-263, 1986. 

6. Michael Colon and Tomas E. Uribe. Generating finite-state abstractions of reactive 
systems using decision procedures. In CAV’98. 

7. Satyaki Das and David L. Dill. Counter-example based predicate discovery in 
predicate abstraction. In FMCAD’02. 

8. S. Graf and H. Saidi. Construction of abstract state graphs with PVS. In CAV’97. 

9. Zohar Manna and Amir Pnueli. Temporal Verification of Reactive Systems: Safety. 
Springer- Verlag, 1993. 

10. John Rushby. Integrated formal verification: Using model checking with automated 
abstraction, invariant generation, and theorem proving. In SPIN’99 workshop. 

11. N. Shankar, S. Owre, and J. M. Rushby. PVS Tutorial. Computer Science Labo- 
ratory, SRI International, Menlo Park, CA, February 1993. 

12. A. Stump, C. Barrett, and D. Dill. CVC: a Cooperating Validity Checker. In 
CAV’02. 

13. Jeffrey X. Su, David L. Dill, and Clark W. Barrett. Automatic generation of 
invariants in processor verification. In FMCAD’96. 

14. Jeffrey X. Su, David L. Dill, and Jens U. Skakkebaek. Formally verifying data and 
control with weak reachability invariants. In FMCAD’98. 

15. A. Tiwari, H. Ruefi, H. Saidi, and N. Shankar. A technique for invariant generation. 
In TACAS’Ol. 




CTL May Be Ambiguous When Model Checking 

Moore Machines 



Cedric Roux and Emmanuelle Encrenaz 
UPMC - LIP6 - ASIM 

12, rue Cuvier, 75252 Paris CEDEX 5 - France 
{Cedric . Roux , Emmanuelle . Encrenaz}@lip6 . f r 



Abstract. The model checking problem is dehned over Kripke struc- 
tures. However, hardware designers often handle other models, such as 
Moore machines. When model checking their designs using CTL as a 
logic, they must translate them into Kripke structures. A given CTL 
property may be believed to be true (conversely false) over the Moore 
machine and in fact be false (conversely true) on the derived Kripke 
structure. This may lead to ambiguities if the designer does not fully 
understand the translation scheme he uses, which may be the case if he 
uses automatic tools. We present iCTL, a logic specifically designed to 
work with Moore machines, which extends CTL to help the designer re- 
moving possible ambiguities when model checking Moore machines. We 
show that it is strictly more expressive than CTL. 



1 Introduction 

While developing a symbolic model checker to verify hardware systems described 
as a composition of synchronous Moore machines, we came across an interesting 
problem. We use CTL [2] as logic and the formulae we want to verify may include 
values of input signals of the Moore machines. These input signals do label the 
transitions of the Moore machine. Since CTL is defined over Kripke structures 
and not Moore machines, and because the transitions of Kripke structures are 
not labelled, when translating a Moore machine into a Kripke structure, one 
has to integrate the input signals in the states of the Kripke structure. Several 
choices are possible. Depending on the translation chosen, the truth value of a 
given property may either be true or false over the derived Kripke structure. 
This introduces an ambiguity that the designer must be aware of when verify- 
ing his designs. He has to know how his model is translated into the one used 
by the model checker, and has to write properties with this in mind, so not 
to get confused by the answer of the tool. Not doing so could even lead to a 
counter-intuitive situation, where the designer might view his model as being 
buggy where in fact he simply wrote wrong formulae, thinking them over Moore 
machines and not over the derived Kripke structures. 

In [3] the authors translate a Moore machine into a Kripke structure by 
incorporating the input configurations in the source state of the transitions. 
And they define the truth value of a CTL property over a Moore machine as 
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being the truth value of this property over the Kripke structure. We think that 
such an approach leads to ambiguities. 

In SMV [5], one directly writes Kripke structures and CTL formulae over 
these structures. It is possible to create free variables (that may represent input 
signals of Moore machines incorporated into the current state of the Kripke 
structure) . This leads to exactly the same situation as [3] . 

The VIS model checker [1] accepts, among others, systems described in a 
Verilog subset, in which collections of Moore machines can be represented. It 
supports modularity and the concept of input and output signals is present. 
However, an input signal can appear in a CTL formula only if it is declared 
of type reg, which forces its assignment in guarded blocks. As a consequence, 
depending on the way this assignment is done, input signals of a Moore machine 
will be included into the source or target state of the transitions in the Kripke 
structure, which influences the results of the verification of a given formula. 

The purpose of this article is to suggest to add two new operators to CTL 
to bring together the intuitive idea one can have regarding the truth value of a 
formula over the Moore machine and the one obtained by the verification algo- 
rithm over the Kripke structure. These two operators are specifically designed to 
handle Kripke structures derived from Moore machines and are meaningless in 
other cases. We hope that their utilization will facilitate the writing of formulae 
and the understanding of the results produced by the model checker. 

2 Translating a Moore Machine into a Kripke Structure 

Several translation schemes from a Moore machine into a Kripke structure are 
possible. The simplest one is to remove the inputs labelling transitions. Since we 
want to express properties including input signals, we abandoned such a scheme. 
Another way is to put the input signals into the target state of a transition. Since 
we plan to compose Moore machines, this solution can’t be retained because the 
outputs of one machine which are inputs of one other have to have the same 
temporal behavior as the other inputs of the second machine. So, we have to put 
the inputs into the source state of a transition. 

Here follows the formal definitions of Kripke structures, Moore machines, 
and the translation scheme we adopted. 

Definition 1. A Kripke structure is a five-tuple {S, So, P, C, R) where 

1. S is a finite set of states, 

2. So C S is the set of initial states, 

3. P is a finite set of atomic propositions (we define np = \P\), 

4-. C = {Iq,... ,lnp-i} is a vector of np functions, each function defining the 
value of exactly one atomic proposition; for all 0 < i < np — 1 we have 
li : S ^ B; for all s € S, we have that Ifis) is true iff the atomic proposition 
associated to U is true in s, 

5. R C S X S is the transition relation. 
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Definition 2. A Moore machine is a structure {S, Sq, 1,0, C, R) where 

1. S is a finite set of states, 

2. So C S is the set of initial states, 

3. I is the finite set of input symbols, 

4- O is the finite set of output symbols (we define no = \0\), 

5. L = {? 0 ) ■ • ■ fino-i} ® vector of no functions, each function defining the 
value of exactly one output symbol; for allO < i < no — 1 we have k : S ^ B; 
for all s € S, we have that Ifis) will be true iff the output symbol associated 
to li is true in the state s, 

6. R C S X 2^ X S is the transition relation. 

The Moore machines we handle are complete and deterministic. Complete 
means that each state has one successor for any input configuration. Determin- 
istic means that for a given input configuration, a state s will always lead to the 
same state s'. 

Definition 3. Translating a Moore Machine by Putting the Inputs in 
the Source State Given a Moore machine {Sm, Smoj ImiOmi J~-m, Rm) , we 
deduce the Kripke structure {Sk, Skq, Pki J~-k, Rk) where: 

- Sk = Sm X 2^’^ , 

~ Sxo = Smo X 2^^ , 

- Pk = ImG Om (we define ni^ = \Im\ and uom = \Om\), 

Pk - 5 — l} ■ — 1 }; f^V all 0 i S: VlOj^ 

we have loi ■ Sk B; for all i, for all s = (si,ci) € Sk, we have that 
loi(s) is true iff Imi{si) is true; for all 0 < i < ni„^ — 1, we have In : 
Sk B (each In is associated to one and only one input signal); for all i, 
for all s = (si,Ci) G Sk, we have that Ijfis) is true iff the component of C\ 
corresponding to the input signal associated to In is true, 

- Rk Q Sk X Sk and V (s, Ci) G 5^, V (s', c') G Sk, we have ((s, c*), (s', c')) G 
Rk iff {s,Ci, s') G Rm- 

An example of a trivial Moore machine and the derived Kripke structure is 
shown in figure 1. 

3 A Disturbing Example 

We could simply state that a CTL formula is true in a Moore machine if and 
only if it is true in the corresponding Kripke structure as done in [3] but the 
verification results obtained may disturb the designer. 

As an illustration, we propose to check the CTL property (EX p) A (EX 
-~'p) over the Moore machine depicted on figure 1. 

This formula would be true on a Kripke structure obtained from the Moore 
machine by removing the inputs, but it is false on the Kripke structure shown on 
figure 1 (which is the one obtained with the translation of definition 3), because 
neither Aq nor A\ has a successor verifying -<p and a successor verifying p. 

In fact, the formula (EX p) A (EX -•p) is ambiguous over the Moore machine: 
do we mean that both successors are selected by the same input configuration 
or by different input configurations? 
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Fig. 1. A trivial Moore machine 
and its derived Kripke structure 




Fig. 2. A Moore machine illustrating 
the use of 3/ and V/ 



4 iCTL — CTL Model Checking with Input 
Configurations 

We introduce two new operators to CTL. These two operators are V/ and 3/. 
This defines a new logic, that we call jCTL. Given </), an zCTL formula (that 
may contain V/ and 3/ operators), V/(/) stands for “for all input configuration, (/) 
holds” and Bjcj) stands for “there is an input configuration for which (f> holds” . 
Here follows the formal definition of iCTL. 



4.1 Syntax and Semantics of iCTL 

The syntax is the same as the one of CTL, with the following added rule for 
state formulae. 

— if / is a state formula, then V// and 3// are state formulae. 

The semantics remains the same, with the following added rules. 

As the two new operators deal with input configurations, the Kripke structure 
they apply on are the ones given by our translation from Moore machines. The 
symbols are thus the same than those from definition 3. 

M,s \= 'iif 4^ s = (sm,cm) and for all c'j^^ € 2^^, s' = {sm, c'm) we 
have that s' |= /, 

M , s \= 3// 44 s = (sm,cm) and for one € 2 -^“, s' = {sm,c'm) and we 
have that s' |= /. 

Since our Moore machines are complete, for all input configurations, the state 
s' exists in the Kripke structure, thus s' ^ / is sound. 

Using iCTL, we now can define when a Moore machine validates a logical 
formula. 

Definition 4. A Moore machine M validates a formula f ofiCTL if and only 
if the formula is true in the corresponding Kripke structure, as given by the 
transformation of definition 3. 
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3j EX p 3j EX p 




A Moore machine Its corresponding Kripke structure 



Fig. 3. An example showing the better expressiveness of iCTL 



This definition is the same as in [3], but we expect the designer to remove 
the ambiguities of CTL by using 3/ and V/ in the places where they are needed. 

4.2 Examples 

The Moore machine of figure 2 will be used as example. 

On the Kripke structure derived from it by the translation of definition 3, 
we’ve got that the formula AX EX p is false in s\.i and si.z. Looking at the 
Moore machine, one might think that this formula is true in si, since all its 
successors have a successor where p is true (states S4 and sq). The formula 
AX (3/ (EX p)) is true in si.i and si.i on the derived Kripke structure. This 
corresponds to the intuition one might have about the truth value of AX EX p 
over the Moore machine. We see here that to capture this intuition, 3/ is neces- 
sary. 

Similarly, the formula EX AX p is true in si.i and Si.i in the derived Kripke 
structure while EX (V/(AXp)) is false in si.i and si.i. This latest interpretation 
seems to be consistent with the intuition that one might have for the truth value 
of EX AX p in the state si of the Moore machine. 

4.3 iCTL Is More Expressive than CTL 

Given a formula / € zCTL and a formula g € CTL, we say that / is equivalent 
to g if and only if for all Kripke structure K derived from a Moore machine M 
using the translation of definition 3, for all state s of K, we have that K,s\= f 
iff A, s 1= g. (This is the global equivalence of [4].) 

On the Kripke structure of figure 3, we can prove (by induction over its size) 
that any CTL formula won’t see its truth value changed in si.i, S 2 -i and S 2 -i if 
we change the labelling of S 3 . But the fCTL formula “3/ EX p” distinguishes 
both cases. Since all CTL formulae are in iCTL, we have that iCTL is more 
expressive than CTL (for Kripke structures coming from definition 3). 

4.4 iCTL and Other Logics 

Modal /i-calculus is a logic dealing with labelled transition systems (thus, able 
to handle Moore machines), which contains the {*) and [*] operators. (*) p is 



CTL May Be Ambiguous When Model Checking Moore Machines 169 



true in a state s if p is true in at least one of its successor, reachable by any 
transition. [*] p is true in a state s if p is true in all the successors of s, reachable 
by any transition. We think that (*) in the p-calculus is equivalent to 3/ EX 
in tCTL and that [*] is equivalent to V/ AX. Formulae 3/ AX p or V/ EX p 
are in zCTL and have a meaning over Kripke structures obtained from Mealy 
machines. We didn’t find equivalent formulae to those in the p-calculus. 

LTL does not present the same ambiguities than CTL since it only captures 
a set of infinite sequences and the sets of sequences of the Moore machine and 
of the derived Kripke structure are equivalent. So, something like “zLTL” would 
be useless. 

5 Conclusion 

The paper discusses the consequences of placing input configurations labelling 
transitions in Moore machines into the source states in the derived Kripke struc- 
ture built to perform CTL model checking. This translation has an impact on 
the verification since a given CTL formula believed to be true or false on the 
Moore machine can have a different truth value on the obtained Kripke struc- 
ture. This is due to the lack of expressiveness of CTL that does not take into 
account labelled transitions, as we find in Moore machines. To overcome this 
ambiguity, we introduce two operators, 3/ and V/. We show that the obtained 
logic, named zCTL, is more expressive than CTL. We have implemented these 
operators in our model checker and it is our intention to verify complex systems 
with this logic. 
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Abstract. Generalized symbolic trajectory evaluation (GSTE) is a new model- 
checking approach that combines the industrially-proven scalability and capacity 
of classical symbolic trajectory evaluation with the expressive power of temporal- 
logic model checking. GSTE was originally developed at Intel and has been used 
successfully on Intel’s next-generation microprocessors. However, the supporting 
theory and algorithms for GSTE are still immature. In particular, GSTE specifi- 
cations are given as assertion graphs, a variety of V-automata, and although an 
efficient model-checking algorithm exists to verify whether a circuit model obeys 
a specification assertion graph, there is no work on reasoning about assertion 
graphs themselves. This paper presents new algorithms to leverage GSTE model 
checking to efficiently decide whether one assertion graph implies another, and to 
model check one assertion graph under the assumption that another is true (under 
regular GSTE acceptance conditions). These two operations — deciding whether 
one specification implies another and verifying under an assumption — are the 
fundamental building blocks of compositional verification and any higher-level 
reasoning about model-checking results, so the algorithms presented here are key 
steps to using GSTE in a broader verification framework. Preliminary experimen- 
tal results applying our algorithms to real, industrial circuits and specifications 
show that our algorithms are useful in practice. 



1 Introduction 

Generalized symbolic trajectory evaluation (GSTE) is a powerful, new model-checking 
approach [20] . GSTE is based on classical symbolic traj ectory evaluation [16], which has 
proven itself able to handle large, industrial designs and has been in active use at Compaq 
(now HP), IBM, Intel, and Motorola (e.g., [12,10,1,4]). Classical symbolic trajectory 
evaluation, although efficient, is very limited in the types of properties that it can specify 
and verify. GSTE extends classical symbolic trajectory evaluation to handle w-regular 
properties, giving it comparable expressive power to more established model-checking 
approaches [5,13,18,8,6], while still maintaining the efficiency and capacity of classical 
symbolic trajectory evaluation. GSTE was originally developed at Intel and has been 
used successfully on Intel’s next-generation microprocessors (e.g., [3]). 

Key to the efficiency and usability of GSTE is the manner in which properties are 
specified, in a variety of automata called an assertion graph. Existing GSTE theory 
provides an efficient procedure for model checking that a circuit obeys an assertion graph, 
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as well as techniques based on abstract interpretation to combat state explosion [21]. 
What is missing, however, is all the supporting theory and algorithms that have developed 
around more established formalisms like CTL [5] or LTL [18]. In particular, there has 
been no published research on how to reason about assertion graphs. 

This paper presents the foundational pieces for reasoning about specifications given 
as assertion graphs. Specifically, we give new algorithms to decide whether one asser- 
tion graph implies another, and to model check one assertion graph under the assump- 
tion that another is true. These two operations — deciding whether one specification 
implies another and verifying under an assumption — are the fundamental building 
blocks for decomposing a verification task, composing verification results, and any 
other higher-level reasoning about specifications. Our current verification system is a 
mixed deductive-algorithmic system, with an efficient GSTE model-checking procedure 
built into a lightweight theorem proven Our new algorithms exploit the existing GSTE 
model-checking procedure, creating an efficient, algorithmic means to discharge ba- 
sic deductive reasoning steps about assertion graphs. Preliminary experimental results 
on real, industrial circuits and specifications show that the algorithms are efficient in 
practice. 



2 Background 

2.1 GSTE and Assertion Graphs 

GSTE is explained in several sources (e.g., [20,21,19], etc.). Here, we concentrate on 
the specification style used by GSTE and highlight its characteristics. 

GSTE is basically a linear-time model-checking method, i.e., the possible behaviors 
of the system being verified is considered to be the set of all possible execution traces, 
and verification consists of checking that all of these traces obey the specification. The 
specification in GSTE is called an assertion graph, and is basically a variety of automa- 
ton. One can think of the assertion graph as defining the set of execution traces that it 
accepts, so the verification problem is basically language containment. Eigure 1 gives a 
simple example and intuitive explanation of an assertion graph. 

In general, an assertion graph is a directed graph with distinguished initial vertex 
wO, and the restriction that all vertices must have non-zero out-degree. Each edge e 
is labeled with an antecedent ant{e) and a consequent cons{e). The antecedents and 
consequents are simply propositional formulas over some set of atomic propositions 
AP. Traditionally, the atomic propositions correspond exactly to the state variables of 
the system being verified, so the antecedents and consequents are formulas over the state 
of the system at some point in time. The assertion graph also has acceptance conditions, 
described below. 

A path in the assertion graph is a directed path (defined in the usual manner for 
directed graphs) starting from the initial vertex uO. Every path in the assertion graph 
specifies a temporal if-then assertion: if the antecedents hold, then the consequents must 
hold as well. More precisely, a path of length n (i.e., with n edges) is an assertion about 
the system’s behavior over a period of n clock cycles. If all of the antecedents along the 
path hold at the corresponding points in the system’s behavior, then all of the consequents 
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WRITE /True ^ READ_SEL_ALIGN / True ^ MASK / DATA_CORRECT 
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True / True 



NO OVERWRITE / True 



WRITE 

NO_OVERWRITE 

READ_SELJ^LIGN 

MASK 

DATA.CORRECT 



(we = 1) A (addr = A) A (datawr = _D) 

(we = 0) V (addr / A) 

(ck = 0) A (we = 0) A (addr = A) A (sel = S) A (align = R) 
(ck = 1) A (maskbegin — B) A (maskend = E) 

(dataout = mask(align(select(D, S), R), B, E)) 



Fig. 1. GSTE Assertion Graph Example. This assertion graph, adapted from [20], was used in 
the verification of an industrial memory design, which reads and writes data with a large variety 
of selection and alignment options. The property being verified is that, if data value D is written 
to address A, followed by an arbitrary number of clock cycles that don’t overwrite the same 
address, followed by a read of the address, then the value returned is the value that was written, 
appropriately aligned and masked. The edge labels are of the form “antecedent / consequent”, 
where the antecedents and consequents are simply propositional formulas over the state of the 
system at a given clock cycle. For example, the antecedent WRITE specifies that the value of 
the write-enable input we is high, that the address input addr is equal to some value A, etc. 
The capital letters denoting values, like A, D, etc., are symbolic constants, which are essentially 
skolem constants that can be equal to any value, making the verification result hold for all possible 
values of the symbolic constants. A path is a sequence of edges that start from the initial vertex 
nO. A terminal path is a path that ends with a terminal edge (shown in the figure by a tic-mark on 
the edge, e.g., the edge from v2 to u3). A path accepts an execution trace if at least one antecedent 
on that path fails (is false on the state of the system at that clock cycle) or if all antecedents and 
all consequents on the path succeed (are true on that clock cycle). Intuitively, a path is an if-then 
assertion: the antecedents say when the assertion is relevant; the consequents say what must hold 
whenever the assertion is relevant. If any antecedent fails, the assertion is vacuously true; if all 
antecedents are satisfied, then all consequents must be satisfied as well. The assertion graph as 
a whole accepts an execution trace if every terminal path in the assertion graph accepts that 
trace. Intuitively, the assertion graph takes a potentially infinite set of assertions about the system 
and rolls them up into a graph; therefore, every trace must satisfy every assertion (vacuously or 
otherwise). 



inputs corresponding to 
atomic propositions in G 









Monitor 

Circuit 


init 



accept 



Eig. 2. Monitor Circuit. Our algorithms rely on a linear-space, linear-time construction for a 
monitor circuit from an assertion graph G. The generated circuit has inputs corresponding to the 
atomic propositions in G and an output that is true iff the sequence of states presented at the input 
would have been accepted by G. The init input initializes the internal state of the circuit. 
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must also hold at the corresponding points, in order for the assertion to he satished. If 
any antecedent doesn’t hold, then the assertion is vacuously true. Formally, if p is a path 
of length n, with p\i] denoting the ith edge in p, and if ct is a trace consisting of n system 
states, with a[i] denoting the ith state, then a satisfies or is accepted by p iff 

(7i 1= (Vzi<2<n- tJi [= . 

For convenience, we will say that “ct satishes the antecedents of p” if Vii<i<„. CTi \= 
anl{p[i]), and that “ct fails at least one of the consequents of p” if ct^ ^ 

cons{p[i]). 

An assertion graph as a whole accepts a given trace iff all “appropriate” paths in 
the assertion graph are satisfied. Appropriate is defined by the four different kinds of 
acceptance in GSTE: 

- In strong satisfiability, a hnite-length trace is accepted iff it satishes all paths of the 
same length in the assertion graph. 

- In terminal satisfiability, some edges are marked as terminal edges, and a terminal 
path is a path that starts from uO and ends with a terminal edge. A hnite-length trace 
is accepted iff it satishes all terminal paths of the same length. 

- In normal satisfiability, an inhnite trace is accepted iff it satishes all inhnite paths. 

- In fair satisfiability, there is a hnite set of fair edge sets. A path is fair iff it visits 
each fair edge set inhnitely often (generalized Buchi fairness). An inhnite trace is 
accepted iff it satishes all fair paths. 

The different kinds of acceptance are listed in (roughly) increasing order of model- 
checking complexity. 

An assertion graph G dehnes the set of traces that it accepts. Call that set the language 
of G, denoted L{G). Similarly, a system M dehnes the set of traces that it can produce, 
denoted L{M). Verihcation consists of proving that L{M) C L{G). In subsequent 
sections of this paper, unless otherwise stated, we will restrict ourselves to terminal 
satishability, which includes strong satishability as a special case, because the hnite- 
trace satishabilities are currently the most commonly used in practice. 

At hrst glance, assertion graphs may appear somewhat bizarre: the an- 
tecedent/consequent edge labels are unusual, as is acceptance based on all paths ac- 
cepting. However, assertion graphs are actually the natural combination of symbolic 
trajectory evaluation and automata-theoretic specihcation. The antecedent/consequent 
style comes from classical symbolic trajectory evaluation [16] and is a natural way to 
specify temporal properties. For example, timing diagrams, one of the most widely used 
hardware specihcations in practice, are typically interpreted this way (e.g., if some se- 
quence of events happens, then some other events must happen) [2]. In addition, the 
explicit identihcation of antecedents and consequents provides an efficiency beneht, be- 
cause the model-checking algorithm can limit its search on-the-fly to paths that satisfy 
the antecedents. The “for all paths” acceptance criteria makes assertion graphs a variety 
of V-automata [9], which are less familiar than the usual existential acceptance of non- 
deterministic automata (where a trace is accepted if there exists a corresponding path 
through the automata), but the V semantics also provides both usability and efficiency 
benefits. The usability arises because an assertion graph dehnes a set of assertions, and 
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one typically wants all assertions to be true; in contrast, usually with automata as specifi- 
cations, the automata directly defines a sef of possible behaviors, so verificalion consisfs 
of determining if the system’s behavior exists in the set provided by the specification. 
The efficiency advantage of the V semantics — as in other works that use V-automata 
as specifications [9,8,2] — is that a V-automaton is essentially pre-complemented, so 
checking language containment can bypass the expensive step of complementing a non- 
deterministic automaton. Indeed, GSTE model-checking is very efficient in practice, and 
the correctness of the algorithm relies on the V semantics. 

We emphasize that assertion graphs take their present form as the direct result of 
practical considerations. The natural theoretical question is what relationship they have 
to more established formalisms. Assertion graphs with fairness can express all w-regular 
properties: an easy construction is to start with a non-deterministic, generalized Btichi 
automata and then to note that the almost-isomorphic assertion graph (with the same 
structure, the same fairness constraints, the Biichi automaton’s edge labels moved to the 
antecedents, and all consequents labeled with False) accepts the complement language, 
w-regular expressiveness follows because w-regular languages are closed under com- 
plementation. The same construction also shows that non-deterministic Biichi automata 
can be simulated with a single-exponential blow-up (to pre-complement the Biichi au- 
tomaton), and that LTL model checking can be translated to GSTE with at worst the 
same complexity as the translation to generalized Biichi automata, for which efficient 
tools exist (e.g., [17]). In the other direction, assertion graphs can be simulated by more 
conventional automata.' Analogous results hold for assertion graphs with terminal sat- 
ishability and ordinary regular automata. In theory, therefore, assertion graphs are no 
more expressive. 

In our case, we have an existing user community with practical experience using 
GSTE assertion graphs as well as an industrially-proven, efficient GSTE model-checking 
tool. The short-term need was for algorithms for rudimentary reasoning with assertion 
graphs — implication and model-checking under assumptions — so we sought to de- 
velop efficient algorithms to perform these operations directly on assertion graphs (with 
terminal satisfiability), exploiting the existing GSTE model-checking engine as much 
as possible. 

2.2 Monitor Circuits from Assertion Graphs 

Our algorithms for reasoning about assertion graphs rely on an efficient (linear space 
and time) algorithm for constructing circuits from assertion graphs, which was inspired 
by efficient methods for generating circuits from regular expressions [15,14,11]. The 
construction is rather intricate and is described elsewhere [7]. Here, we give a brief 
overview. 

Given an assertion graph G, we construct a monitor circuit for G. A monitor circuit 
is simply a small circuit that watches, without interfering, the system being verified and 

* Simulation by a conventionally labeled V-automata can be done with twice as many states; 
simulation by a normal 3-automata requires an exponential blow-up. We would like to thank 
the anonymous reviewers for suggesting the construction for simulation via conventional V- 
automata, and for pointing out that there cannot he a general suh-exponential construction to 
simulate assertion graphs via normal 3-automata or vice-versa. 
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flags whether or not the system is obeying some user-specified correctness property. In 
this case, the monitor circuit has inputs corresponding to the atomic propositions AP 
that are used in G. The monitor circuit has a single output accept, which is true iff 
the trace that has been observed on the inputs would be accepted by G. The circuit is a 
Mealy machine, so the value at the inputs is immediately reflected at the accept output. 
The circuit also has an init input, which initializes the internal state of the circuit; init 
is asserted at the same time that the first state of the execution trace is presented at the 
inputs, and then de-asserted from then on. See Figure 2 . 

Intuitively, the monitor circuit has an internal copy of the assertion graph and keeps 
track of paths by placing tokens on the edges in its copy. In theory, each token represents 
a path that ends on that edge at that clock cycle, and the token remembers the history 
of which antecedents and consequents were true during preceding clock cycles. At each 
clock cycle, tokens can update their histories and advance to the next edge, possibly 
splitting into multiple tokens if there are multiple out-going edges. The monitor accepts 
a trace iff all tokens represent accepting paths. The key insight to making this construction 
efficient is that the tokens can actually be almost memoryless. The only history necessary 
is to distinguish between three different kinds of pasts: ( 1 ) if an antecedent has failed 
already, this path and its continuations will always accept, so they need not be tracked 
any further, ( 2 ) if all antecedents and all consequents so far have succeeded, then this 
path currently accepts, but its continuations might not, and ( 3 ) if all antecedents have 
succeeded, but at least one consequent has failed, then this path currently rejects, but its 
continuations might eventually accept if an antecedent fails in the future. All paths with 
the same history that arrive at the same edge at the same time will share the same future, 
so their tokens can be merged. Hence, the constructed monitor circuit has a structure that 
exactly corresponds to the assertion graph, with two state bits per edge to track the two 
kinds of tokens, and a constant amount of circuitry per edge and per vertex to update the 
tokens appropriately. The constructed circuit is clearly linear-size compared to G. 

3 Assertion Graph Implication 

We now consider determining whether one assertion graph Gi implies another assertion 
graph G2, or, equivalently, whether L{Gi) C L{G2)- 

3.1 Implication via Product Construction 

The monitor circuit construction immediately yields an obvious way to determine 
whether T(Gi) C L{G2)'- 

1 . Build circuits Gi and G2 for the assertion graphs Gi and G2. 

2 . Tie the inputs together. 

3 . Verify on the combined machine, using GSTE or any other model checking method, 
whether accept accept2 in all reachable states. 

The disadvantage of this approach is that we are building circuits for both Gi and G2, 
rather than using G2 as a specification, potentially increasing the possibility of state 
explosion. Instead, we would like to harness the efficiency of GSTE and avoid adding 
G2 to the state space. 




176 



A.J. Hu, J. Casas, and J. Yang 



3.2 Implication via GSTE 

Given a circuit M and an assertion graph G, GSTE model checking provides an efficient 
way to determine whether L{M) C L(G), or equivalently, whether M is a model of 
G, notated M \=t G. (The “T” is for terminal satisfiability.) With our construction 
of a circuit from an assertion graph, one might consider generating a circuit Gi from 
assertion graph Gi , and then determining whether Gi G2 by model checking whether 

Cl \=T G2. Unfortunately, this approach does not work: Gi is a monitor circuit that 
indicates whether or not an input stream was an accepting trace; it is not a circuit whose 
behaviors are exactly the accepting traces. A more subtle approach is needed. 

The idea behind our algorithm is to modify G2 so that it ignores traces that are 
not accepted by Gi. More precisely, given assertion graphs Gi and G2, we determine 
whether Gi G2 as follows: 



1 . Without loss of generality, we assume that the initial vertex uO of G2 has in-degree 
of 0. (If this is not the case, we can modify G2 by creating a duplicate initial vertex 
uO', which has the same incoming and outgoing edges as uO, and then we delete the 
incoming edges to the true initial vertex uO.) 

2. Apply the monitor circuit construction to Gi, resulting in circuit Gi. 

3. Modify G2 to work with Gi, creating a new assertion graph G^: 

a) The new graph G'2 has all of the same vertices as G2 . 

b) For every edge e in G2 from vertex Vi to vertex Vj, create two edges e' and e", 
both from vertex Vi to vertex Vj . Set 



ant(e') 

ant{e") 



ant{e) A accept A init if Vi = wO 
ant{e) A accept A -linit otherwise 



ant{e) A -laccept A init if Vi = uO 
ant{e) A -laccept A ^init otherwise 



The consequents do not change: cons{e) = cons(e') = cons{e”). Edge e' is a 
terminal edge in G '2 iff edge e is a terminal edge in G2. Edge e" is not a terminal 
edge. 

c) Add init and accept to the atomic proposition set. 

Figure 3 shows this construction applied to the assertion graph from Figure 1. 

4. Use GSTE to model check whether Gi \=t G^. The result is true iff Gi G2. 

Proof that Gi G2 implies C\ \=t G' 2 : 

Suppose Cl G'2. Then, there exists a trace a' of Gi and a terminal path p' of G'2, 
of the same length, where a' satisfies all the antecedents in p' , but fails at least one 
consequent. Define the trace u by projecting out the accept and init signals from 
each state of a' . Define path p in G2 formed from p' by mapping back through the edge 
doubling. We prove that cr is a witness that Gi ^ G2 by showing that: 

1. cr \=T Gi- 

2. p is a terminal path in G2 . 

3. a satisfies the antecedents along p. 

4. a fails at least one consequent along p. 
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AN / True 




WRITE&AI / True 



READ_SEL_ALIGN&AN / True 
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MASK&AN / DATA. 



MASK&RN / DATA. 




RN / True 



NO OVER WRITE&RN / True 



Fig. 3. Assertion Graph Modified to Consider Only Accepting Paths. This figure shows the 
result of modifying the assertion graph in Figure 1 using the construction from Section 3.2. 
Edge labels are defined as in Figure 1, with AI ~ accept A init, AN := accept A -linit, 
RI := -laccept A init, and RN := -laccept A -linit. The implication construction modifies 
an assertion graph so that it considers only the accepting paths of the other assertion graph. The 
basic idea is to double all edges, with one edge guessing that the path is accepting and the other 
edge guessing that the path is rejecting. Because these guesses are in the antecedents, paths that 
guess wrong are disregarded. The modification also ensures that the monitor circuit is initialized 
properly, via the init signal. 

Claim 1: We know that a' satisfies the antecedents of p' . Therefore, the circuit C\ is 
initialized properly, because the antecedent constrain the init signal. Also, the 
accept signal is true in the last state of cr', because p' ends on a terminal edge, so 
a is an input sequence that would end up with C\ accepting. Therefore, cr \=t G\, 
by the construction of C\ . 

Claim 2: G '2 is created by doubling the edges of G2. Undoing the doubling maps the 
path back to a path on G2 . Since p' ended on a terminal edge in G'2 , the corresponding 
edge in G2 must also be a terminal edge, so p is a terminal path. 

Claim 3: Recall that a' satisfies the antecedents of p' . The path p has antecedents that are 
strictly weaker than the corresponding antecedents in p' , because they are missing 
the conjuncts about accept and init. Therefore, a satisfies the antecedents of p. 
Claim 4: We are given that a' fails at least one consequent along p' . The consequents 
are the same in p and p' , so a must fail the corresponding consequent along p. I 

Proof that Gi G2 is implied by Gi \=t G'^. 

Suppose Gi ^ G2. Then, there exists a trace cr (in the state space defined by the atomic 
propositions) such that a \=t Gi, but cr G2. We will construct a trace a' of Gi that 
is not accepted by G'2, witnessing that Gi G'2. 

We construct a' by augmenting the state space of a with values for init and accept. 
For the initial state of o' , set init to be 1 . In all other states of o' , set init to be 0 . 
Because Gi is a Mealy machine, we can always compute the value of accept by feeding 
o as input to C\. Thus, o' is a trace of Gi by construction. The resulting trace o' has 
atomic proposition values that are the same as o and has accept true in the last state 
(because o is accepted by Gi). 

Since o G^, we know there exists a terminal path p in G2, of the same length as 
cr, such that o satisfies all the antecedents in p but fails at least one consequent. Construct 
path p' in G'^ as follows: Match p edge-for-edge, picking the accept or ->accept version 
of the edge in G'2 depending on the value of the accept signal in o' . Since o' ends with 
accept true, the constructed path p' ends at a terminal edge in G'^- 
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Now, we see that a' satisfies the antecedents in p' because the states/antecedents are 
the same as in cr and p (with the accept' or ^accept' edge chosen correctly by the 
construction of p'). On the other hand, cr fails at least one consequent of p, so cr' must fail 
the corresponding consequent of p' , since the consequents are the same in both paths. 
Therefore, cr' witnesses that C[ G^. I 



4 Model Checking under an Assumption 



Besides assertion graph implication, the other main reasoning tool we wanted was how 
to perform GSTE model checking under an assumption. We notate this problem Co \=t 
(Gi G2), meaning that all behaviors of a circuit Co that satisfy an assertion graph 
Gi (the assumptions) also satisfy the assertion graph G2. This construction is closely 
related to the preceding one. 

The basic idea is that we build a monitor circuit Gi for Gi and augment Gq with this 
monitor, in a non-interfering manner. Then, we modify G2 so that it ignores traces that are 
not accepted by the monitor, resulting in verifying only the behaviors of Gq that satisfy 
the assumptions of Gi . An alternative intuition is to consider the implication construction 
in Section 3.2 as the special case of model checking a completely unconstrained machine 
under the assumption of Gi; here, we constrain the inputs of Gi to be the behaviors of 
Co. 

1 . Without loss of generality, we assume that the initial vertex uO of G2 has in-degree 
ofO. 

2 . Build the monitor circuit C\ from Gi. 

3 . Connect the inputs of Gi to the state variables of Gq. In this way, Gi will watch 
Go and indicate accept/reject depending on whether or not Go’s behavior obeys the 
assertion graph Gi . Call this combined circuit Goi . 

4 . Build G'2 from G2 by edge-doubling and modifying the antecedents, exactly as in 
the implication construction. 

5 . Goi hT G' iff Go hr (Gi ^ G2). 

Proof: 

The constraints on init in the antecedents of G'2 guarantee that we only consider traces 
in which C\ is properly initialized. 

The monitor circuit Gi has no effect on Gq. Therefore, Goi has the same traces as 
Go, except for some additional state bits that determine whether or not Gi would have 
accepted the trace. 

Any path in G'2 that guesses accept/reject incorrectly on any edge will have its 
antecedent fail and will be ignored. For any path in G2, there will always exist a corre- 
sponding path in G'2 that guesses accept/reject correctly for every edge. The only paths 
that are checked are the ones that are terminal in G'2, which means that they were termi- 
nal in G2 as well, and also that the accept signal is true, which means that Gi would 
have accepted the path. Thus, we check only the traces of Go that satisfy Gi . I 
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True / True 



NO_OVERWRITE / True 




READ_SEL_ALIGN&READ_RESULT / True 






MASK / DATA_CORRECT 




True / True 



Fig. 4. Decomposing a Property. We have manually decomposed the assertion graph from Fig- 
ure 1 into two smaller ones. The edge labels are as before, except READ := (ck = 0) A (we = 
0) A (addr = A) and READ_RESULT (memout = D), where memout is the internal data 
output of the memory array. We model check that the memory unit obeys the smaller assertion 
graphs, and then use our implication construction to verify that the two smaller assertion graphs 
imply the original specification. This process took less than 2/3 the time of verifying the original 
property directly. 

5 Experimental Results 

We have implemented the above algorithms into Intel’s Forte verification system^ and 
report their effectiveness on two verification tasks taken from real, industrial problems. 

5.1 Decomposing a Verification Property: Verifying a Memory Unit 

The first example is the verification of an industrial memory unit, using the assertion 
graph from Figure 1. Verifying this assertion graph on the memory unit by directly 
applying GSTE model checking required 56 seconds. 

Alternatively, we manually decomposed the assertion graph into two smaller asser- 
tion graphs G\ and G 2 , which separates the memory behavior from the selection and 
alignment specifications. See Figure 4. GSTE model checking these two specifications 
on the memory unit took 28 seconds and 7 seconds, respectively. Note that, because of 
the V semantics, we can produce the assertion graph for G\ A G 2 simply by having the 
two graphs share a single initial vertex. Accordingly, we verified that (Gi A G 2 ) implies 
the original assertion graph, using the implication construction from Section 3.2. This 
step took 0.3 seconds, and the generated monitor circuit for (Gi A G 2 ) had 5338 gates 
and 44 latches — far smaller than the memory unit. The total verification runtime was, 
therefore, less than 36 seconds, compared to the original 56 seconds. 

Obviously, for such a small property, the time savings are not enough to repay 
the effort of decomposing the property. Nevertheless, we see that the decomposition 
does reduce the overall model-checking complexity, and our new algorithm does enable 
verifying automatically that a combination of sub-properties implies a more complex 
one. For larger, more challenging verification tasks, being able to decompose a difficult 

^ Forte is available for download at 

http://www.intel.com/software/products/opensource/toolsl/verification/ 
but our new algorithms are not yet part of the the standard distribution. 
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Fig. 5. Content-Addressable Memory (CAM). A CAM allows finding data by matching the 
value of a tag. In this CAM, a 64-bit data value is written at the same time as an 8-bit tag. Values 
can be read by supplying the correct tag. The match[i] signals indicate which of the 16 tags 
matches a supplied tag. The “outputs” on the right are for verification only: hit = \J . match[i], 
and matchout[i] = datamem[i] if match[i] is true, otherwise matchout[i] = 0. The overall 
CAM has 1 152 latches. Our verification will cut the circuit at the dotted line. We first verify the 
tag portion of the circuit, then use that assertion graph as an assumption to verify the data portion 
of the circuit. 



property into smaller ones, verify the smaller properties, and then conclude that the 
original property holds, is extremely useful. 



5.2 GSTE with an Assumption: Content-Addressable Memory 

The second example is from the verification of a content-addressable memory (CAM). 
This example illustrates GSTE model checking under an assumption. 

A CAM allows finding data in its memory by matching a given tag value in an array 
of stored tags, i.e., by matching a value to the content of storage locations, rather than 
by address. CAMs are ubiquitous in modern microprocessors, where they are used to 
cache small amounts of frequently accessed data (e.g., in caches, TLBs, and assorted 
other buffers). Figure 5 shows the CAM for this example. 

We wish to verify that the CAM as a whole satisfies the assertion graph G 2 in Figure 6. 
Verifying this assertion graph on the CAM by directly applying GSTE model checking 
required 15 seconds. Alternatively, to evaluate our algorithm for model checking under 
an assumption, we first verified the correct operation of the tag portion, in isolation, 
against the tag-correctness assertion graph G\ in Figure 7. This verification took 0.8 
seconds. Then, we abstracted away the tag portion of the CAM and used our algorithm 
for verification under an assumption to verify that G 2 holds, assuming that Gi does: 
(data portion of CAM) \=t (Gi =:> G 2 ). This verification took 7 seconds. Altogether, 
the decomposed verification was roughly twice as fast as the direct approach, and the 
monitor circuit for G\ had only 12 latches, an order of magnitude less than the tag 
memory that was abstracted away. 
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TAG_WRITE&DATA_WRITE / Tme^ TAG_READ / TAG_RESULT&DATA_RESULT , 






(v2 



True / True 



TAG RETAIN&DATA RETAIN /True 



TAG_WRITE 

DATA_WRITE 

TAG_RETAIN 

DATA_RETAIN 

TAG_READ 

TAG_RESULT 

DATA_RESULT 



(twrite = 1) A (taddr = A) A (tagin = T) 
(dwrite = 1) A (daddr = A) A (din = D) 
(twrite = 0) V (taddr / A) 

(dwrite = 0) V (daddr / A) 

(aread = 1) A (tagin = T) 

(hit = 1) A Vi[{i = A) (match[i] = 1)] 
Vi[(i = A) ^ (matchout[i] = D)] 



Fig. 6. CAM Correctness Specification. This assertion graph specifies that if a tag and data 
values are written, followed by an arbitrary number of cycles in which they are not overwritten, 
followed by a read by the same tag, then the CAM must indicate a hit, and the matchout signal 
must give the correct data value at any matching locations. 



v0> 



TAG_WRITE / True ^TAG_READ / TAG_RESULT 



< 5 > 



iv2 



True / True 



TAG RETAIN /True 



Fig. 7. Tag Correctness Specification. This assertion graph specifies that if a tag is written, not 
overwritten for an arbitrary number of cycles, and then the same tag is presented, the hit signal 
and the correct match signal must be asserted. We first verify this property on the tag portion of 
the circuit. Then, we use this assertion graph as an assumption to abstract away the tag portion of 
the circuit when verifying the whole CAM. 



As in the previous example, the time savings on a small verification task are not 
enough to repay the time to manually decompose the problem. Nevertheless, this example 
does demonstrate how our new algorithm runs efficiently and enables decomposing a 
harder verification problem into smaller, easier ones. In general, we envision using this 
style of proof for simplifying complex verification tasks, and also for verification with IP 
cores (portions of a circuit supplied by third-parties, for which functionality is specified, 
but internal details are not visible) as well as the verification of partial or incomplete 
circuits. 



6 Conclusion and Future Work 



We have presented new algorithms for reasoning about GSTE assertion graphs. These 
algorithms appear efficient in theory, and preliminary experiments indicate that they are 
efficient in practice as well. Given the increasing practical importance of GSTE model 
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checking, the need for (practically efficient) supporting theory and algorithms is great. 
This work is a first step. 

The practical success of GSTE is the justification for studying assertion graphs. In 
theory, assertion graphs are simply a new variety of automata, with equivalent expres- 
sive power to established varieties of automata, so an obvious, fundamental question is 
to elucidate whether and how GSTE is gaining efficiency advantages over older tech- 
niques. Do assertion graphs facilitate writing specifications in a manner that enables 
more efficient model checking? Are other aspects of GSTE, completely separable from 
assertion graphs, more important for efficiency? Can we leverage fhese ideas wifh other 
verification methods? On the other hand, perhaps the practical successes have been pri- 
marily the result of the overall verification methodology, the types of verification tasks 
undertaken, or the skill of the verification engineers. Assertion graphs and GSTE give 
symbolic-trajectory-evaluation-based approaches comparable expressive power to other 
model-checking approaches, so it is now possible to make direct comparisons. 

Eocusing on assertion graphs, research is needed on composing and decomposing 
assertion graphs. For example, given the V semantics, it should be possible to decompose 
a large assertion graph into the conjunction of smaller ones, as is possible in formaliza- 
tions of timing graphs [ 2 ]. Such a decomposition could reduce the complexity of model 
checking. 

A related, and perhaps more immediately applicable, direction for research is to 
look for transformations and inference rules for assertion graphs. For example, it is easy 
to see that adding edges, weakening antecedents, or strengthening consequents are all 
operations that cannot enlarge the set of traces accepted by an assertion graph. Perhaps it 
is possible to develop a powerful set of inference rules to reason about assertion graphs, 
without having to perform model checking. 

The work presented here are fundamental building blocks for reasoning about asser- 
tion graphs. An important next step is to develop compositional verification theorems, 
so that we can automate the process of stitching together partial verification results. 

Finally, although assertion graphs are interesting to consider in isolation as a variety 
of automata, in practice their use is intimately tied to GSTE model checking. This con- 
nection suggests that it may be interesting to consider weaker notions of implication (and 
equivalence). For example, rather than defining Gi G2 to mean L(Gi) C L(G2),we 
could use the weaker definition: V circuits M.{M ^ Gi) {M ^ G2). Under all the 
different acceptance conditions, we have constructed small assertion graphs Gi and G2 
such that L{Gi) ^ L{G2), but that are equivalent under the weaker definition because 
no circuit satisfies eifher one. (The infuition is fhat real circuits cannot generate arbitrary 
sets of strings, e.g., a circuit can always be run for one more clock cycle, generating 
a longer string.) We do not know whether the difference between these definitions is 
theoretically interesting or practically important. 

In general, increasing evidence demonstrates the practical value of GSTE and asser- 
tion graphs, but the supporting infrastructure is underdeveloped. Much work remains to 
be done. 
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Abstract. Many industrial verification teams are developing suitable 
event sequence languages for hardware verification. Such languages must 
be expressive, designer friendly, and hardware specihc, as well as efficient 
to verify. While the formal verihcation community has formal models for 
assessing the efficiency of an event sequence language, none of these mod- 
els also accounts for designer friendliness. We propose an intermediate 
language for event sequences that addresses both concerns. The language 
achieves usability through a correlation to timing diagrams; its efficiency 
arises from its mapping into deterministic weak automata. We present 
the language, relate it to existing event sequence languages, and prove 
its relationship to deterministic weak automata. These results indicate 
that timing diagrams can become more expressive while remaining more 
efficient for symbolic model checking than LTL. 



1 Introduction 

The increasing adoption of formal verification has led to a flurry of research 
into property specification languages for hardware verification. Large-scale ef- 
forts include Accellera’s standardization of Sugar [1], Synopsys’ OVA [13], and 
Intel’s FTL [4]. Generally speaking, these are event sequence languages: they 
allow designers to express sequences of events to monitor and check during ver- 
ification. The proliferation of work from industry on event sequence languages 
emphasizes that they must be designer friendly, expressive, and specific to the 
hardware domain in addition to efficient to verify. Although practical experience 
and theoretical results give insights into how to achieve these goals individually, 
few formal models attempt to address usability and efficiency simultaneously. 

In the space of event sequence languages, timing diagrams provide an ap- 
pealing combination of usability and efficiency. Designers have established their 
utility by regularly employing them as an informal design tool. Mappings from 
formalized timing diagrams to deterministic weak automata [8] provide effec- 
tively linear symbolic verification algorithms [5]. That timing diagrams are not 
more widely used as event sequence languages suggests that they lack the ex- 
pressiveness needed in industrial verification [3] . Their combination of utility and 
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LTL: -na A X(a A ((^6 A X{b A F(^c A Xc)))V 
X(^foAX(&A F(-.cAXc)))V 
XX(^&AX(feA F(^cAXc)))) 




Sugar: -ia & next!(a & next_e![l,3](-'fe & next! {b & eventually! (-ic & next! c)))) 
Fig. 1. Expressing an event sequence in three languages. 



efficiency, however, raises an interesting question: how expressive can we make 
an event sequence language while retaining both diagrammability and efficiency? 

This paper explores this question by proposing a (textual) intermediate lan- 
guage for capturing event sequence languages. To target diagrammability, we de- 
sign the core of the language around timing diagrams. To target expressiveness, 
we extend the core language to capture constructs from other event sequence lan- 
guages. To target efficiency, we syntactically characterize which expressions in 
this language map to deterministic weak automata. The results of this work are 
twofold: first, our language provides a framework in which to assess both usabil- 
ity and efficiency of other event sequence languages; second, our characterization 
proves that timing diagrams can be extended with several new features — such 
as partial orders between events, interleaved environmental assumptions, escap- 
ing conditions, and event clocks — without losing their mapping to deterministic 
weak automata. Our long-term goal is to develop formal models that simulta- 
neously characterize both usability and efficiency in event sequence languages. 
This paper focuses on the efficiency of verifying our proposed language; future 
papers will treat formal models of diagrammability as a measure of usability. 

2 Preliminaries 

2.1 Event Sequences and Timing Diagrams 

Event sequences, as their name implies, capture sequences of events on signals in 
a design; they express properties for verification or simulation. Regular expres- 
sions and linear temporal logic have similar goals, but also some subtle differ- 
ences. Event sequences often monitor transitions on signals in the design, rather 
than just boolean values of propositions. In addition, event sequences generally 
capture timing constraints between events. While both regular expressions and 
linear temporal logic can capture these features, the resulting expressions can 
be rather cumbersome, especially in contrast to event sequences and timing di- 
agrams. Figure 1 shows a simple example of the same event sequence expressed 
as a timing diagram, in linear temporal logic (LTL), and in Sugar. 

Although timing diagrams present event sequences somewhat intuitively, they 
are not as expressive as some other event sequence languages. For example, tex- 
tual event sequence languages easily express disjunctions, while diagrams in gen- 
eral capture disjunctive information poorly. The mapping from timing diagrams 
to weak automata, which does not hold for full LTL, demonstrates benefits to 
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T = {{a t,ct,2,5,true), 
(c t, a 4-1 li 00, true), 
{a 4,, b 4,, 3, 9, true)} 



Fig. 2. A timing diagram with partial orders and its mapping into an event sequence. 




this limited expressive power. The question, then, is how far we can push timing 
diagrams while retaining this mapping. The timing diagram shown in Figure 2, 
for example, expresses some disjunction as the order of events is left unspecified 
(a partial order rather than a total one). This extension adds expressive power 
without sacrificing diagrammability or weakness. We are interested in similar 
extensions based on constructs from modern event sequence languages. 



2.2 Weak Automata 

A Biichi automaton {Q, S,qQ, R, L,J^) is weak if it has only one fair set and 
each of its strongly connected components has either all states fair or no states 
fair [10]. Weak automata are attractive in verification because symbolic cycle 
detection is effectively linear for weak automata, as opposed to quadratic for full 
LTL [5] . Deterministic weak automata are particularly interesting for their prop- 
erties under complementation. Automata-based verification approaches comple- 
ment automata that capture properties. In the general case, complementing a 
Biichi automaton can blowup the number of states exponentially. Complement- 
ing a deterministic weak automaton, however, requires only complementing the 
fair set; the structure of an automaton and its complement are otherwise iden- 
tical. This represents a substantial savings in construction time, and more im- 
portantly, in the size of automata used to represent complemented properties. 



3 An Intermediate Language for Event Sequences 

This section presents a regular-expression-like notation for event sequences. We 
motivate the development of the language using the example timing diagram 
shown in Figure 2. We explain the semantics of the diagram informally; the 
formal details appear elsewhere [7] . 

To capture the diagram, the language must express transitions on signals and 
constraints (timing and ordering) between these transitions. Let propositional 
literals (p, ->q) denote boolean values and propositional variables annotated with 
arrows (p 4-i P t) denote falling and rising transitions, respectively. Let semicolons 
denote concatenation (temporal sequencing) of events. Using these notations 
and reading off the timing diagram from left to right suggests the expression 
(a t; 6 t; c tj o 4-j ^ i) • If we interpret semicolons as implying order between events 
(a common interpretation of concatenation), this expression is inconsistent with 
the semantics of the timing diagram. The rising transitions on a and b may 
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occur in any order since no constraint orders them (the falling events on a and 
b, in contrast, must occur in order). The event sequence language must therefore 
support partial, rather than only total, orders between events. 

Timing diagrams consist of totally-ordered regions within which individual 
events are partially ordered. For sake of generality, our event sequence language 
supports hierarchical combinations of ordered, unordered and iterated groups 
of events. In the formal syntax and semantics that follows, we refer to these 
groups of events as clusters. We capture partial orders within unordered clusters 
using a separate annotation for transition (timing) constraints between events; 
a timing constraint specifies the events covered, lower and upper bounds on the 
time between the events, and the clock against which the bounds are measured 
(true specifies the system clock). This approach treats constraints between events 
uniformly, whether they occur in ordered or unordered clusters. Figure 2 shows 
the resulting event sequence for our example timing diagram. 



3.1 Syntax 

The timing diagram example suggests the following syntax for event sequences: 
Definition 1 Clusters are defined hierarchically as follows: 

— An event is a conjunction of values of and transitions on variables that 
contains at least one transition. Propositional literals (p, ^q) denote boolean 
values; propositions with arrows (p I, p f) denote transitions. 

— A cluster is either: 

• a single event, or 

• an unordered cluster {Ci, . . . , Cfc} where each Ci is a cluster, or 

• an ordered cluster {Ci; . . . ; Ck) where each Ci is a cluster, or 

• a repeating cluster where C is a non-repeating cluster and M is a 
positive number, *, or -|- (called a repetition marker, markers * and -I- 
are called unbounded). 

An event sequence consists of a (top level) cluster and three kinds of mod- 
ifiers. Temporal constraints, already motivated, may be relative to a designer- 
specified event clock, as captured by a boolean expression (this is a common 
feature in event sequence languages). To indicate that certain variables hold 
value during regions (between events) in a diagram, holding patterns constrain 
variable values within clusters. To allow portions of diagrams to serve as assump- 
tions rather than requirements, escape conditions capture circumstances under 
which the sequence should be immediately rejected or accepted. 

Definition 2 An event sequence is a tuple (C, H, T, S) where C is a cluster, H 
(the holding patterns) is a partial function from C to propositional formulas, T 
is a set of temporal constraints and S' is a set of escape conditions. 
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C = (at+;{n.ct};rft) 

Lf = t, c t} a 

T = {(ct,dt.2,5,true)} 

S = {accept-if-don’t-complete(a t^)} 





af 


at 




cf 


bt 


dt 









Fig. 3. A sample event sequence and an example of its semantics. 



~ A temporal constraint is a tuple (ei ,€ 2 , 1, u, elk) where ei and 62 are (uniquely 
identified^) events in C, / is a positive integer, u is either an integer at least 
as large as I or the symbol 00 , and elk is a boolean expression (the clock for 
the constraint; true indicates the system clock). Events ei and 62 may lie in 
different clusters, but then they must lie in the same repeated clusters. 

~ An escape condition has one of three types, where X is a boolean expression 
over events (the events need not be in C) and C is a cluster within C: 

• “accept if don’t complete C'” 

• “reject if see X in C"” 

• “accept if see X in C"” 

Figure 3 illustrates an event sequence of some number of rising transitions on 
a, followed by rising transitions on b and c (in either order), followed by a rising 
transition on d. The transition on d must occur between 2 and 5 ticks (inclusive) 
after the transition on c (the timing constraint), signal a must remain true until 
the transition on d occurs (the holding pattern) , and the rest of the sequence is 
only checked if the transition on a occurs (the escape condition) . 

The language contains some redundancy for sake of clarity: ordered clusters, 
for example, can be viewed as unordered clusters plus timing constraints. To 
simplify the semantics and proofs, we assume that all sequences are in reduced 
form, in which all clusters C’*' are replaced with (C; C*), all for a concrete 
number M are replaced with an ordered cluster of M copies of C , and all ordered 
clusters (Ci ; . . .;Ck) are replaced with unordered clusters and timing constraints 
that require an event from each Ci to occur before an event from each C^+i. 

3.2 Semantics 

The semantics of event sequences is defined in terms of languages over infinite 
words, where each character in a word is an assignment of boolean values to 
variables. An infinite word models an event sequence if there exists a mapping 
from the clusters in the sequence to ranges of indices into the word (herein called 
windows) such that the windows assigned to each cluster preserve the cluster’s 
constraints; these mappings are called index assignments. 

As an example, consider the event sequence and word shown in Figure 3. 
The word is divided into windows per cluster (demarcated by solid lines), and 
subwindows as necessary for nested clusters (demarcated by dashed lines). We 
first formalize the mappings from clusters to windows. 



^ A numbering scheme could distinguish syntactically similar events. 
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Definition 3 Given a word W, a window of W is a, subword of W; a pair of 
indices into W, denoted [i,j] where i < j, defines a window. Furthermore, 

~ An individual index i defines a trivial window [i,i\. 

— Window [A, 12] contains window [13,^4] iff i\ < 13 and 14 < Z2- 

~ Window [14,12] is earlier than window [13, 14] iff i\ < i^ or A = 13 and 12 < 14. 

— Given a window w = [start, end], a sequence [si, ei], . . . , [s^, Ck] forms a non- 
overlapping covering sequence of windows for w if Si = start, Ck = end, and 
for all 1 < j < k, 6j < Sj+i- 



Definition 4 A (partial) index assignment for event sequence V and word W 
is a (partial) function from the clusters in (including nested within) V to non- 
empty sets of windows of W. 

A window must meet certain requirements in order to capture the constraints 
of a cluster. The following definitions formalize those requirements. 

Definition 5 Let E = vi A . . . Avk where each Vi is a proposition, its negation, 
or a rising or falling transition on a proposition. Let VF be a word and i an index 
into W. Let Wi{q) denote the value of proposition q at index i into W. Index i 
satisfies E if for every Vi, Wi{p) = 0 if = ~<p, Wi{p) = 1 if Vi = p, Wi{p) = 0 
and Wi+i{p) = 1 if Vi = pf, and Wi{p) = 1 and Wi+i{p) = 0 if Vi = p f. 



Definition 6 Given an unordered cluster C = {Ci, . . . , Ck}, a schedule of C is 
a sequence CO\, . . . , COj of non-empty subsets of C such that 

— COi, . . . , COj partition C, 

— In every COi that contains multiple elements of C, all elements of COi are 
single events (rather than other complex clusters) , and 

— For each timing constraint (ci, 62, 1, u, elk) such that 64 € COi and 62 G COj, 

i < j. 



Definition 7 Let V be an event sequence, W a word, and I a partial index 
assignment for V and W. I is structurally valid iff for every cluster C in V: 

— If C is an event, then for every [i,i] G /(c), i satisfies C (Defn 5). 

— If C is a repeating cluster C* , then for every wp in I{C'*) there exists 
a natural number m and some sequence wpi ,■■■ , wpm of non-overlapping 
covering windows for wp such that each wpi G I(C'). 

— If C is an unordered cluster {Ci, . . . ,Ck}, then for every window w G I{C) 
there exists a schedule CO4, . . . , COj for C and a sequence Wi,. . . ,Wj of 
non-overlapping covering windows for w such that for all f < /i < j and all 
e G CPh, Wh G /(e). 
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Definition 8 Let V = (C, T, H, S) be an event sequence, let W he a word, and 
let I be an index assignment for V and W . I is constraint valid for V and W iff 

1. I satisfies the holding patterns, in that for all clusters C", every x G H{C) 
and every window [rui, W 2 ] G I{C), every index w\ < i < W 2 satisfies x, and 

2. I satisfies the timing constraints, in that for every {ei, 62 , l^u, elk) G T and 
every ti G d(ei) and t 2 G ^(^ 2 ) such that ti and t 2 fall in a common window 
for the smallest cluster containing both ei and 62 , the number of indices 
satisfying elk between ti and ^2 (inclusive) is within the range 

Constraint validity handles timing constraints and holding patterns, but not 
escape conditions. The next two definitions handle escape conditions. Defini- 
tion 12 relates words and event sequences based on the existence of index as- 
signments that may or may not invoke escape conditions. Given index assignment 
I, let I be the inverse of I (mapping windows to sets of clusters). 

Definition 9 Let V be an event sequence, W a word, and I a structurally 
valid index assignment for V and W. Let E be an escape condition of type 
“accept/reject if see X in C”. Index i into W invokes E under / if i G I{C), 
i satisfies X, and I is defined for all clusters in the images of I for windows 
occurring before i. We also say that I invokes an escape condition of V. 

Definition 10 Let V be an event sequence, W a word, and I a structurally 
valid index assignment for V and W. I loops under escape condition E ii E is 
of the form “accept if don’t complete C” and / is defined for all clusters in the 
images of I for windows occurring before i, but not for a window containing i. 

For the semantics to yield a deterministic procedure for checking whether 
a word satisfies an event sequence, index assignments must assign the fewest 
and earliest possible windows to clusters (in particular, this renders both * and 
scheduling deterministic). We formally define this notion of minimality as follows: 

Definition 11 Let V be an event sequence and let W he a word. Let / and /' 
be non-equivalent index assignments for V and W. Let Rg denote the range of 
a function. / ^ iff 

1. the earliest window in one but not both of Rg{I) and Rg(I') is in Rg{I), or 

2. Rg{I) = Rg{I') but for w, the earliest window such that I{w) yf 
I'{w) C I{w)- 

Given a set E of index assignments, / G 27 is minimal in 27 iff for all I' G 27, 
I < r . {< does not order all pairs, but is sufficient for our theorems [9].) 

We now define when a word models an event sequence: 

Definition 12 Let V be an event sequence and let IF be a word. W \= V if 
there exists a minimal and structurally valid index assignment / for V and W 
such that / is a total function and constraint valid, or I loops under some escape 
condition in G, or / invokes some escape condition in V . 
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The semantics captures one occurrence of an event sequence, rather than 
the multiple occurrences needed to treat an event sequence as an invariant. The 
one-occurrence semantics offers two benefits: it provides a foundation for defining 
different multiple occurrence semantics [7], and it enables the mapping to weak 
automata. This restriction is not as limiting as it might seem: in prior work [8], 
we showed that relabeling fair sets and adding a few transitions constructs the 
automaton for a negated invariant event sequence (the machine most commonly 
needed for verification) from the machine that accepts one occurrence. 

4 Relationship to Existing Event Sequence Languages 

To motivate the intersection between our simultaneous goals of diagrammability 
and efficiency, this section shows how several features of existing event sequence 
languages do or do not map into the proposed intermediate language. 

4.1 Timing Diagrams 

Section 3 illustrated the connection between timing diagrams and our proposed 
event sequence language. The language presented here extends our previous re- 
sults on the relationship between timing diagrams and weak automata [8] in 
two ways. The previous result held for timing diagrams with a total order on 
their transitions and a prefix of the diagram as an environmental assumption 
(as in, “if the rising transition on a occurs, then match the whole diagram”). 
As a corollary to the results in this paper, timing diagrams with partial event 
orders and multiple non-contiguous assumptions on the environment also map 
to deterministic weak automata. We view environment assumptions as events 
that are only constrained if they occur [6]; unlike other events, their failure to 
occur does not violate the diagram’s requirements. For the diagram in Figure 2, 
we could treat the two transitions on a as environment assumptions by rewrit- 
ing the event chain using nested clusters (as ({a f, ^ t> c t, a i}; b i) and adding 
“accept-if-don’t-complete” escape conditions on the two clusters for a. 

The proposed language is more expressive than our current timing diagram 
formalization. Consider the cluster (a t)- The timing diagram semantics 
requires all depicted transitions to occur unless an escape condition matches, so 
this expression (without escape conditions) is currently not expressible as a tim- 
ing diagram (since a f might not occur) . Similar examples involving repetitions 
also exist. Enriching the timing diagram notation could resolve some of these 
issues; this remains an issue for future work. 

4.2 LTL, Sugar, and FTL 

Sugar and FTL are similar in that each extends conventional LTL. Since there 
exist LTL formulas that cannot be captured by weak automata, certain FTL 
and Sugar formulas will not map into our intermediate language. Weakness 
primarily characterizes the location of fair sets in automata. In LTL, fairness 
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Fig. 4. Automata for two LTL formulas. 



constraints arise from combinations of eventualities and cycles (the operators U 
and G). Figure 4 shows automata that capture two formulas: (p U g) U r and 
p U(G(g U r)). The first example yields a weak automaton and corresponds to 
cluster {{p*; g)*; r). The second corresponds to cluster (p*; (g*; r+)*) with escape 
condition “accept if don’t complete r+”; this expression violates our syntactic 
restrictions for weakness presented in Theorem 3 (Section 5.3). 

One key difference between these two formulas is that the second contains a 
repetition within its last cluster, while the first does not. This same difference 
characterizes the automata for the regular expressions (aa)* and {aa)*b, the 
first of which cannot be captured by a deterministic weak automaton while 
the second one can. An automaton can recognize a nonrepeating final pattern 
without creating a fair set. This motivates our characterization of weakness: the 
final cluster cannot end with an unbounded repetition marker. 

Certain other features of Sugar and FTL do not adversely impact weakness. 
FTL’s changc-on and rejection constructs indicate when a sequence should be 
immediately accepted or rejected; escape conditions capture such scenarios in 
the proposed intermediate language. For example, augmenting (p U g) U r with 
escape condition “accept if see reset in g” would introduce a new state labeled 
reset with an incoming edge from the state for g; this automaton is also weak. 

4.3 OVA 

Of the recent event sequence languages discussed in this paper, OVA most closely 
matches the proposed language. Unlike Sugar and FTL, OVA does not explicitly 
support LTL or CTL operators. The OVA istrue construct maps into holding 
patterns, and their non-overlapping event clocks map into ours. Unlike the pro- 
posed language, however, OVA can express disjunction among sequences and 
negation of sequences. Our language does not support negation because negated 
sequences generally cannot be realized diagrammatically. Our language does, 
however, still support constructing deterministic weak automata for the nega- 
tions of event sequences, as outlined at the end of Section 3. 

5 Relationship to Deterministic Weak Antomata 

This section characterizes which sequences in our language map to determinis- 
tic weak automata; almost all do, with the exception of those with particular 
interactions between escape conditions and repeated clusters. We construct an 
automaton corresponding to the semantics, prove the construction sound, then 
characterize when the resulting machine is both weak and deterministic. 
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Fig. 5. Overview of the automaton construction algorithm. 



Given an event sequence V, we construct a Biichi automaton that accepts all 
words with a prefix that models V. Figure 5 illustrates the intuition behind the 
expansion. The construction recursively expands states corresponding to clusters 
until all states correspond to individual events. Holding patterns, escape condi- 
tions, and the ordering aspects of timing constraints are incorporated as this 
expansion proceeds. The durational aspects of timing constraints are handled in 
a final phase once all states correspond to individual events. 

Each intermediate machine during the computation abstracts the final ma- 
chine, in that if there is no edge from one abstract state to another, then there 
is no edge from any state in the expansion of the first to the expansion of the 
second in the final machine. For sake of space, we present the detailed algorithm 
only up through creating states for each event; this is sufficient for our theorems. 

The construction creates edges between abstract states based on which clus- 
ters can precede or follow other clusters; it also relies on notions of the first and 
last subclusters that could be encountered in a cluster. These concepts match 
intuition. For sake of space, we defer all but the definition of next clusters to the 
full paper [9]; examples of all four notions follow the definition. The theorem in 
Section 5.2 also refers to first and last events, which are obtained by iterating 
the first and last computations on clusters until they contain only events. 

Definition 13 Let C be a cluster immediately contained in a cluster (if C 
has no enclosing cluster, next(G) is empty). If = (Ci; . . . ; Cfc) and C = Ci 
for i < k, then next(C) is if Q+i is not an repeating-* cluster and {Ci+ijU 
next(Ci+i) if Ci+i is an repeating-* cluster. If C = Cfc, then next(C) is next(C^). 
If C is an repeating-* cluster, next(C) also includes C. The case for unordered 
clusters unions similar results over all possible schedules, and repeated clusters 
C have next(C) as {C}U next(C^). 

Examples: Given sequence (Ci; C2; C|; C^), next(C2)=next(C3)={C3, C4} and 
prev(C3)={C2, C3}. Given sequence (Ci; {C21, CJ2}; {C31, C32}*; C4) with a tim- 
ing constraint from C21 to C22, next(C2i)={C22, C31, C32, C4}. For the first and 
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last sets, first({C2i, C'|2})={C2i}, last({C2i, C'|2})={C|2}, and firstdCsi, <732}) 
= last({C3i,C'3d}) = {C31,C*2}. 

Algorithm 1 To construct an automaton for event sequence {C,T, H, S): 

1. Create a state Final with a self loop and mark it fair. 

2. Create a state for C and mark it initial, final, and unexpanded. 

3. Repeatedly select an unexpanded state N for some non-event cluster C and 

— Add holding patterns and edges for the escape conditions for C to N. 

— Expand N according to the type of C and remove N. 

— If A was marked initial (resp. final), mark the new states for all first 
(resp. last) clusters of C initial (resp. final). Copy all other propositional 
annotations (including fair) from N to the new states from the expansion. 

4. Add an edge from each state marked final to the state Final. 



Expand Repeated Clusters. For a state for repeated cluster C* , add an edge from 
the state for each previous cluster of C* to that for each next cluster of C* . 

Expand Unordered Clusters. For a state N for unordered cluster C = 
{Ci,...,Ck}: 

— For every schedule COi, . . . ,COh of C, create a chain of abstract states 
CONi , . . . , CONh. For every non-self-loop edge coming into N, add an edge 
from the same source to CONi . For every non-self-loop edge leaving N, add 
an edge from CONk to the target of the original edge.^ 

— Eliminate unnecessary nondeterminism by merging states with the same 
incoming transitions and labels into single states (this shares common prefix 
states across the various permutations). 

— If A had an edge to itself, add an edge from each sink state in the subgraph 
that expands A to each source state in the subgraph that expands A. 

Flandle Escape Conditions and FI aiding Patterns 

— For each escape condition E of the form “reject if see X in C” , create a new 
abstract state Ne for E, label Ne with A, add an edge from each abstract 
state corresponding to C to Ne and add a self-loop at Ne. 

— For each escape condition E of the form “accept if see X in C” , create a new 
abstract state Ne for E, label Ne with A, add an edge from each abstract 
state corresponding to C to A^;, add a self-loop at Ne, and mark Ne as fair 
(with a new fairness constraint). 

— For each escape condition E of the form “accept if don’t complete C” , mark 
every abstract for C as fair (with a new fairness constraint) . 

— For each holding pattern h for cluster C and each abstract state Nq corre- 
sponding to or expanded from C, add ft- as a propositional label to Ac. 

^ To reduce the machine size, we could perform a bisimilarity minimization on the 
subgraph of all states that expanded A. 
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Following Algorithm 1, all states correspond to single events but the du- 
rations of timing constraints have not been enforced. We handle this using a 
similar algorithm to that in our prior work [8]. For sake of space, and since the 
expansion into events does not affect weakness or determinism by construction, 
we do not reproduce the details here. To handle the event clock elk in & timing 
constraint over events ei and C 2 , the construction adds a unique label for elk to 
each state between ei and C 2 , and creates an automaton that outputs this label 
whenever elk is true. A final step cross-products the core machine with the clock 
machines; this does not affect weakness. 

The results on determinism and weakness that follow apply to those event 
sequences that end with a concrete event rather than a repetition (for reasons 
motivated in Section 4.2). We call such sequences event chains. 

Definition 14 An event sequence {C, T, H, S) is an event chain if the iterative 
expansion of last(C') contains no repeated clusters. 

5.1 Soundness 

Theorem 1. Let V be an event sequence and let M he the automaton obtained 
for V from Algorithm 1. Let W he an infinite word. M accepts W iff W \= V. 

Proof Sketch: Intuitively, the proof develops a correspondence between states 
in the abstract machines and the windows in the range of an index assignment 
for W and V. The theorem follows from an argument that the windows occurring 
in accepted (resp. rejected) words correspond to accepting (resp. rejecting) paths 
through the automaton. 

5.2 Characterization of Determinism 

Theorem 2. Given an event chain, Algorithm 1 produces a deterministic au- 
tomaton if all of the following conditions are satisfied: 

— For every unordered cluster {Ci, . . . ,Ck}, the first events of each Ci are 
pairwise logically inconsistent with those of each Cj yf Cj unless a timing 
constraint orders Ci and Cj. 

— For each repeated cluster C* , the first events of C are pairwise logically 
inconsistent with the first events of each next cluster of C* (other than C). 

— For each “accept /reject when see X in C” escape condition, X is logically 
inconsistent with all holding patterns for C. 

Proof Sketch: The machine is deterministic if the choice among multiple next 
states is deterministic. The construction yields multiple next states in four cases: 
possible transitions to the Final state, when choosing between schedules for an 
unordered cluster, possible skips of repeated clusters, and when invoking escape 
conditions. The restriction to event chains guarantees that states with transitions 
to Final have no other outgoing transitions. By construction, transitions into the 
states that expand clusters occur when a first event is recognized for that cluster. 
If these events are logically inconsistent, then the corresponding transitions must 
be deterministic. This covers the remaining cases. 
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5.3 Characterization of Weakness 

We call a cluster C fair if there exists an escape condition of the form “accept if 
don’t complete C” . A cluster is all- fair if it is either fair or all of its sub-clusters 
are all-fair. A cluster is non-fair if neither it nor any of its sub-clusters is fair. 

Lemma 1. If an event sequence contains no all-fair repeated clusters, then the 
automaton from Algorithm 1 requires only one fair set. 

Proof Sketch: If no cycle contains states from more than one fair set, then a 
single fair set suffices. Cycles can contain states from multiple fair sets under two 
conditions. First, two “accept don’t complete” conditions could exist for clusters 
Cl and C 2 where Ci contains C 2 . In this case, a cycle that satisfies C 2 satisfies 
Cl, so only one fairness constraint is required. Second, a repeated cluster could 
have all sub-clusters fair, thus creating a cycle that visits each sub-cluster then 
self-loops for the repeated cluster. The theorem statement rules out this case. 

Theorem 3. Given an event chain, Algorithm 1 produces a weak automaton iff 
every repeated cluster in the chain is non-fair. 

Proof Sketch: Non-trivial strongly-connected components (SCCs) arise from 
abstract states with self-loops, which in turn arise from expanding states for 
repeated clusters. With the exception of the Final state and the states for “ac- 
cept/reject if see” escape conditions (which form their own SCCs), states are 
marked fair only if they correspond to or expand from clusters that have “ac- 
cept if don’t complete” conditions. If a repeated cluster is non-fair, then it has no 
fair SCCs embedded within self-loops (other, larger SCCs). If a repeated cluster 
is all-fair, it requires multiple fair sets and is not weak by definition. All other 
repeated clusters contain cycles with both fair and non-fair states. 

Our mapping to deterministic weak automata is not complete; in other words, 
our language does not logically characterize deterministic weak automata. Con- 
sider the regular expression ab* -\- be*: a deterministic weak automaton accepts 
it, but it is not expressible in our language due to the use of disjunction. 



6 Related Work 

We are unaware of logical characterizations of weak automata, much less ones 
that account for diagrammability or other forms of usability. The original 
work on the efficiency of verifying weak automata is due to Bloem, Ravi and 
Somenzi [5]. Other timing diagram formalizations have supported some of the 
language extensions discussed here [2,6,12], but none related the diagrammatic 
features of these languages to efficiency in verification. 

Amla et al. ’s work on modular timing diagrams has much in common with 
this work [3]. Their work makes timing diagrams more expressive by combin- 
ing them through non-diagrammatic operators for conjunction, iteration, and 
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deterministic choice. Expressions in their language encompass several timing 
diagrams, while our work pushes the limits of a single timing diagram. Accord- 
ingly, they target efficiency through a different model of automata. The core 
differences between our works appear to be philosophical; ours focuses on un- 
derstanding the interplay between diagrammability and efficiency, while theirs 
focuses on building a practical verification framework around timing diagrams. 
The full paper provides a more detailed comparison [9] . 



7 Conclusions and Future Work 



The relationships between timing diagrams and deterministic weak automata 
suggest that there exist formal models of event sequences that simultaneously 
address both usability and efficiency. A traditional theoretical approach to de- 
signing languages towards efficiency would be to find a syntactic (logical) char- 
acterization of weak automata. This approach, however, fails to account for the 
usability of that logical characterization. This is perhaps justifiable, as ’’usabil- 
ity” is an inherently informal notion. If we refine our notion of usability to mean 
diagrammability, however, formal models become possible. Formal characteriza- 
tions of diagrammability usually rely on topological or spatial arguments [11]; 
appropriate characterizations for discrete linear events remain an open problem. 

The event sequence language proposed in this paper targets diagrammabil- 
ity by allowing only a restricted form of disjunction; in particular, disjunction 
governs the ordering of events, but not their occurrence. This is consistent with 
diagrams’ tendency to imply that all depicted items actually exist (maps, for 
example, indicate that all depicted features are actually there). Such nuances 
in the different uses of logical operations appear fundamental to formal models 
of diagrammability. This limited nature of disjunction also targets efficiency by 
supporting our criteria for deterministic automata. Restricted forms of iteration 
enable the mapping to weak automata. Single timing diagrams support lim- 
ited forms of iteration, and hence satisfy the criteria for weakness. Overall, the 
generality of our language substantially enriches the set of features our timing 
diagrams can support while retaining efficiency for verification. 

Several avenues remain open for future work. Given that the proposed lan- 
guage is more expressive than our current timing diagrams, characterizing dia- 
grammability is an important next problem in this project. We expect restric- 
tions on cluster nesting similar to those in timing diagrams to be key to such 
a characterization. We also plan to explore formal relationships between other 
event sequence languages and ours; this would help identify subsets of other 
languages that could be visualized and verified efficiently through a mapping to 
weak automata. Finally, many general questions remain regarding the nature of 
diagrammatic representations and their relationship to computational concerns 
such as efficiency and decidability that are so important in verification. We hope 
that our work will contribute to better understanding of these issues. 
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Abstract. The Accellera Property Specification Language (PSL) is de- 
signed for the formal specification of hardware. The Reference Manual 
contains a formal semantics, which we previously encoded in a machine 
readable version of higher order logic. In this paper we describe how 
to ‘execute’ the formal semantics using proof scripts coded in the HOL 
theorem prover’s metalanguage ML. The goal is to see if it is feasible to 
implement useful tools that work directly from the official semantics by 
mechanised proof. Such tools will have a high assurance of conforming 
to the standard. We have implemented two experimental tools: an inter- 
preter that evaluates whether a finite trace w, which may be generated by 
a simulator, satisfies a PSL formula / (i.e. w \= f), and a compiler that 
converts PSL formulas to checkers in an intermediate format suitable for 
translation to HDL for inclusion in simulation test-benches. Although our 
tools use logical deduction and are thus slower than hand-crafted imple- 
mentations, they may be speedy enough for some applications. They can 
also provide a reference for more efficient implementations. 



1 Introduction 

We describe the implementation of two tools that work by applying theorem 
proving strategies to the formal semantics of the Accellera Property Specifica- 
tion Language (PSL [3]). The implementation method guarantees that the results 
are compliant with the standard. Accellera [2] is an industry consortium formed 
in 2000 by combining “Open Verilog International” and “VHDL International” . 
PSL is being developed as a standard property language for both dynamic ver- 
ification (e.g. simulation) and static verification (e.g. model checking) [8]. The 
design of PSL is based on IBM’s Sugar language. 

Previously we constructed a deep embedding of the Sugar semantics in higher 
order logic. Using the HOL theorem proving system we proved various general 
meta-theorems (see Section 2) and were able to provide some feedback and bug 
reports to the language designers [12,11]. As the semantics evolved into the 
current standard we tracked the changes and made sure that our proofs in HOL 
still went through. Our semantics is believed to correspond faithfully to the 
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official formal semantics in the PSL Manual, but we cannot be completely certain 
because the official semantics is expressed in a mixture of English and ETgX. 

Not only can theorem provers like HOL be used to prove meta-theorems, they 
can also be programmed to dynamically generate theorems for particular models 
and formulas. This provides a way of implementing tools that work deductively. 
The approach of having tools with ‘HOL Proof Inside’ has been explored by the 
Prosper project [7] and it is our goal to apply Prosper ideas to build verification 
tools that work with ‘deduction from PSL semantics inside’. This paper describes 
some preliminary experiments. 

PSL has four kinds of syntactic constructs: Boolean Expressions b, Sequential 
Extended Regular Expressions r (SEREs), Foundation Language (FL) formulas 
/ and Optional Branching Extension (OBE) formulas. 

The PSL Foundation Language (FL) contains standard future-time LTL for- 
mulas as well as less standard formulas that are composed out of regular expres- 
sions. Formula {r}(/) is true if / holds at the last state of any sequence matching 
r; formula {ri} i— >■ {r 2 }! is true if every sequence matching ri is followed by a 
sequence matching r 2 - FL also has abort formulas / abort b that check / but 
aborts the checking if a state in which b is true is encountered, and clocking 
formulas f@c that are true when / is true of the sequence of states consisting 
of only those states for which clock c holds. 

The OBE is conventional branching time Computation Tree Logic (CTL). 
Hasan Amjad has built a symbolic model checker for OBE properties that uses 
BDD representation judgements applied to our semantics to calculate the truth- 
value of PSL properties with respect to Kripke structures. This is described 
elsewhere [4]. 

The semantics of SEREs specifies w; ^ r to mean that a finite sequence 
of states w matches the regular expression r. Then semantics of FL formulas 
specifies w; ^ / to mean that formula / holds of a path (i.e. a finite or infinite 
sequence of states). The detailed semantics is in Section 2. PSL also has a large 
number of operators that are defined in terms of the primitives. As we shall 
illustrate, they can be added by making definitions in HOL. 

Using standard methods of semantic embedding, w \= f can be viewed as a 
boolean term of higher order logic, and then automated proof by the HOL system 
can be applied. We have implemented a proof strategy to evaluate w \= f where 
ic is a specific finite path and / is a formula. Currently all formulas except aborts 
are covered (though a few special cases of w; ^ / abort b can be evaluated) . This 
strategy implements a tool that is useful for sanity checking that a property 
expresses what one expects: one can directly evaluate it on example paths and 
the result is guaranteed to correspond to the official semantics. Example paths 
can either be input directly as a sequence of states (a state is a set of atomic 
propositions), or can be captured from a simulation run (see Section 3.3 for 
examples) . Evaluation is fast enough to be used on simple examples and provides 
a pedagogically useful animation of the semantics. 

Our second tool, inspired by the IBM FoCs system [1], compiles a formula 
/ (from a subset of PSL formulas) into a checker automaton that can be added 
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to a simulation test-bench to detect when a property is violated. The checker is 
initially represented in an HDL-neutral format but can be ‘pretty printed’ into 
the syntax of particular HDLs. We have implemented a simple converter that 
generates Verilog. This provides a way of prototyping tools similar to FoCs, but 
which are guaranteed by construction to conform to the Accellara standard. Al- 
though generating a checker can be slow (seconds to minutes), the resulting HDL 
code can be efficient, and it is guaranteed to be equivalent to the PSL property it 
was compiled from. We think this compiler might be useful for debugging other 
property generators. Also, since the compilation is driven by symbolic execution, 
it can be tuned just by adding new theorems into the set of rules that are used. 

The rest of this paper is as follows: Section 2 describes the Accellera property 
Specification Language (PSL) and its semantics in higher order logic; Section 3 
presents our first tool, which evaluates w \= f for a given w and /; Section 4 
presents our second tool, a checker generator. 



2 The Accellera Property Specification Language PSL 

This section describes the semantics of the linear parts of PSL (boolean expres- 
sions, SEREs, FL formulas) and is a careful manual transcription of the official 
semantics in the Language Reference Manual [ 3 ] into the machine readable logic 
supported by the HOL system. 

Boolean expressions are evaluated with respect to states. SEREs are evalu- 
ated with respect to finite sequences of states, and FL formulas with respect to 
finite or infinite sequences of states. A non-empty set P of atomic propositions 
is assumed given. A state is a subset of P, i.e. the set of propositions that are 
true in the state. If p ranges over P, then the syntax of boolean expressions b is: 

b ::= p (Atomic proposition) 

I -•b (Negation) 

I &i A 62 (Conjunction) 

This is represented in higher order logic by defining a new type (using a data 
type definition mechanism), parameterised on P, whose elements are boolean 
expressions. The semantics of boolean expressions are specified by defining s \= b, 
where s C P, by structural induction over the type of boolean expressions: 

(s 1 = p = p G s) A (s 1 = —'b = —<{s 1 = b)) A (s 1 = &i A &2 = s |= &i A s |= 62) 

Here, and in what follows, the operator “|=” binds tightly, so that, for example, 
s \= bi A s \= b2 means (s \= bi) A (s ^ 62) not s ^ ( 5 i A s ^ 62)- The 
symbols -■ and A are overloaded: the occurrence of -■ in -<b is part of the boolean 
expression syntax of PSL, but the occurrence in -i(s ^ b) is negation in higher 
order logic. Similarly A is overloaded: the occurrence in b\ A 62 is part of the 
boolean expression syntax, but the other occurrences are conjunction in higher 
order logic. 
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2.1 Semantics of Unclocked SEREs and FL Formulas 

In this section we do not specify the semantics of clocked SEREs r@c and for- 
mulas f@c. These are described in Section 2.2. 

The syntax of SEREs is represented in higher order logic by defining a new 
type whose elements represent SEREs. If r, ri, V 2 etc. range over Sequential Ex- 
tended Regular Expressions (SEREs) and b and c range over boolean expressions, 
then the syntax of SEREs is: 

r ::= b (Boolean formula) 

I {i"i} I {t 2 } (Disjunction) 

I ri;r 2 (Concatenation) 

I Ti : T 2 (Fusion: overlapping concatenation) 

I {ri}&&{r 2 } (Length matching conjunction) 

I r[*] (Repeat) 

I r@c (Clocking - semantics in Section 2.2) 

The semantics of a SERE r is given by specifying w \= r for every finite 
sequence of states w. This can be read as “word w is recognised by regular 
expression r” . 

Words are represented as lists. A list containing elements eg, . . . , e„ is denoted 
by [eg; . . . ; e„]. Juxtaposition of words denotes concatenation (e.g. w[s]w' is the 
concatenation of w, [s] and w'). If wlist is a list of lists then Every p wlist 
applies the predicate p to every element of wlist and returns the conjunction of 
the result (e.g. in the semantics below Every {Xw. w \= r) wlist asserts w \= r for 
every w in wlist) and Concat wlist denotes the concatenation of the lists in wlist 
(e.g. Concat [[o; &]; [c]; [d; e;/]] = [a; b; c; d] e;/]). The notation |w| denotes the 
length of w (empty words have length 0) and Wi denotes the tth element of w 
counting from 0, so wq is the first element (note that subscripts on symbols not 
denoting lists are just subscripts). The input and output to HOL shown in this 
paper has been typeset using a HOL-to-Latex translator implemented by Keith 
Wansbrough. Applying this translator to the HOL semantics of SEREs yields: 

{w \= b = (|w;| = 1) A wg 1= &) A 

{w 1= ri; T 2 = 3wl w2. {w = wl w2) A wl |= ri A w2 |= T 2 ) A 
{w \= r\ : r 2 = ^wl w2l. {w = wl [l]w2) A wl [1] ^ ri A [l]w2 |= r 2 ) A 
{w ^ {ri} I {r 2 } = w ^ ri V w ^ T 2 ) A 
{w ^ {ri}&&{r 2 } = w ^ ri A w; ^ T 2 ) A 

(w ^ r[*] = 3wlist. {w = Concat wlist) A Every(Aw;. w |= r)wlist) 

It is hoped that this semantics requires no additional explanation. Interested 
readers can compare it to the semantics in the PSL Reference Manual [3, B.2.2.1]. 

The syntax of PSL Foundation Language Formulas (FL) is given below. The 
suffix “!” found on some constructs indicates that these are ‘strong’ (i.e. liveness- 
enforcing) operators. If the corresponding weak operator (which is written with- 
out the “!” suffix) can be defined in terms of FL formulas, then it is not included 
in the core and is regarded as an defined operator (e.g. Xf = -iA!-i/ and 
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f@c = -i(-i/@c!)). The distinction between strong and weak operators is dis- 
cussed and motivated in the PSL Manual [3, Section 4.4.3]. 

The syntax is represented in higher order logic by defining a new type whose 
elements are formulas. The FL primitives listed below are redundant. For exam- 
ple, {ri} I— {r 2 }! and X\ f can be defined in terms of suffix implication. 
f ::= p (Atomic formula) 

I -■/ (Negation) 

I fi f 2 (Conjunction) 

I X\ f (Successor) 

I [/i U f 2 ] (Until) 

I {’"}(/) (Suffix implication) 

I {?"i} '— >■ {^’ 2 }! (Strong suffix implication) 

I {’’i} {^ 2 } (Weak suffix implication) 

I / abort b (Abort) 

I f@d (Clocking - semantics in Section 2.2) 

Paths can be either finite or infinite. The notation w* denotes the i-th tail 
of w, i.e. the path obtained by chopping i elements off the front of w (so 
w'^ = w). The notation w*’-’ denotes the finite sequence of states from i to j 
in w, i.e. WiWi+i ■ ■ -Wj. The juxtaposition denotes the path obtained by 

concatenating the finite sequence on to the front of the path w'. The HOL 
semantics of FL formulas is: 

{w \= b = |w;|>0A?/;o|=&)A 
{w\=^f = -.(w \=f)) A 
(w\=fiAf 2 = w \= fi A w \= f 2 ) A 
{w \= X\ f = jwj > 1 A ^ /) A 

{w \= [/i U f 2 ] = G (0 .. |w;|). ^/2 A Vj G (0 .. k). \= fi) A 

{w ^ {r}(/) = Vj G (0 .. jwj). r ^ \= f) A 

{w \= {ri} i-A- {r 2 }! = Vj G (0 .. |w;|). |= ri G (j .. \w\). \= T 2 )A 

{w ^ {ri} {r 2 } = 

Vj G (0 .. |w;|). 

yjOj ^ ^ g Q yjj,k ^ r)y(yk G {j .. |w|). 3w'. w^'^w' ^ r 2 ))A 

{w \= f abort b = w \= fWw ^ &V3j G (1 .. jwj). 3w;'. ^ bAw°'^~^w' \= f) 

This semantics is a careful formalisation of the official semantics, with the 
exception that jw] > 0 has been added to the definition of w; \= b. This addi- 
tion ensures that formulas are defined for empty paths (the official semantics is 
undefined). The semantics for non-empty paths is unchanged. 

2.2 Semantics of Clocked SEREs and FL Formulas 

SEREs and formulas not containing are called unclocked and the sets of 
unclocked SEREs and formulas the unclocked subsets. In the previous section 
only the semantics of the unclocked subsets were defined. This is called the 
unclocked semantics. 
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Clocked SEREs have the form r@c and strongly clocked formulas the form 
f@d, where c is a boolean expression that is true when the clock is asserted. 

Weakly clocked formulas f@c are defined by f@c = -i((-i/)@d). Intuitively, 
w ^ r@c and w |= f@c! mean, respectively, that w\c |= r and w\c |= / where 
tcjc is obtained from w by removing (‘projecting out’) all states in which c is 
false (i.e. restricting w to states in which c is true). 

The formal semantics in the Reference Manual doesn’t use projections, in- 
stead two separate semantics are given: the first one defines the semantics of 
all constructs (included clocked ones) directly, the second one provides a set 
of ‘rewrites’ that can be used to recursively eliminate all occurrences of 
i.e. translate into the unclocked subsets. 

The direct semantics is specified by recursively defining w; |= r and w; )= / 
for an arbitrary clock c, and then the semantics of a SERE r and formula / are 
w \= r and w \= f , respectively, where T is the top-level clock which is always 
true. The top-level semantics with a clock c are w r@c and w |=^ f@cl. 

The rewrites semantics is formalised by first defining, for each clock c, a 
function 7”'” that maps an arbitrary SERE or formula into the unclocked subset. 
Thus Ac. is a function mapping a clock c to a translation function which 
has c as the clock context. The top-level clock is T, so the top-level translations 
of r and / are T^(r) and T^(/). The meanings of these can then be computed 
using the unclocked semantics in Section 2.1. 

The definition of (= is much more complex than the definition of and we do 
not give it here. However, we have formalised it in higher order logic and proved 
[12,11] the sanity checking property that, if ClockFree(r) and ClockFree(/) mean 
that r and / are unclocked, then: 

h Vr w. ClockFree(r) (w; [= r = w; ^ r) 
h V/ w. ClockFree(/) ^ [w f = w \= f) 

We have also proved using the HOL system that: 

h Mr w . w r = w;|= {r) 

h V/w. = wh 

which allows us to evaluate the semantics of any construct by first applying these 
equations and then using the unclocked semantics. 

The definition of T‘‘(r) and T‘’(/) is by recursion over the structure of SERE 
r and formula /. For SEREs: 

(T‘'{b) = (-'c[*j; c A &)) A 

(r‘=(n;r2)=r^(ri);r‘=(r2))A 

(T‘’(ri : T2) = T‘’(ri) : T'=(r2)) A 

(r‘=({n}|{r2}) = {r=(ri)}|{r^(r2)})A 

('^''({H}&&{r2}) = {T‘’(ri)}&&{T‘’(r2)}) A 

(r‘=(r[*]) = T‘’(r)M) A 

(r‘=(r@ci) = (-ciH;ci 



and for formulas: 
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(r"(&) = b)A 

= -T‘=(/)) A 

{T‘^{fiAf 2 )=T%h)AT%f 2 ))A 
(T'=(X!/) = X!(hcC/(cAr^(/))]))A 
(T'=([/1 U M) = [(c ^ r^(A)) U{cA TAA))]) A 
(TA{r}(/)) = {rAr)}(hc U{CA TV))])) A 

(TA{ri} {ra}!) = {T^g)} ^ {T‘=(?' 2 )}!) A 
(TA{n} ^ {r2}) = {TAn)} ^ {TA?'2)}) a 

(T'^(/ abort b) = T^{f) abort (c A b)) A 

(rA/@ci!) = hciC/(ciAT'=A/))]) 



3 Executing the Formal Semantics 

The HOL system has an ML function EVAL [ 5 ] which when applied to a term t 
proves a theorem \- t = t', where t' is the result of evaluating t. EVAL performs 
call-by-value order rewriting efficiently using logic definitions that are in force 
in the context in which it is invoked. It can also invoke equations and decision 
procedures that have been explicitly added to the context. 



3.1 Executing the Clock Removal Rewrites 

The semantics of a formula / with respect to a path w is w \= f ■ The first step 
in evaluating w; ^ / is to rewrite with the equations: 

h \/r w. w r = (r) and h Wf w. w f = w \= T~^ (f) 

The next step is to execute the definition of T^, and the final step is to evaluate 
the unclocked semantics (Section 3 . 2 ). 

The clocking removal rewrites are directly executable, but the results are 
complicated. For example EVAL (T‘^(T[*]; {-ir(/@ci}&&{afc@C2}; r(/@ci)) evalu- 
ates to the almost completely incomprehensible theorem: 

h T‘^(T[*]; {-irg@ci}&&{afc@C2}; rg@ci) = 

cAT[*];{-.ci[*];ci : -ici[*];ci A ^rq}kk{^C2[*]; 

C2 ■ -'C2[*]; C2 A afc};-'Ci[*]; ci : ci A rq 

This illustrates how much more natural and high-level are properties expressed 
using the @c clocking construct. Note also that ci : -iCi)*] is equivalent to 
Cl, which shows the need to perform peephole optimisations on the output of 
naive evaluation with the rewrites. Executing the rewrites for formulas typically 
produces even more incomprehensible results than with SEREs! For example, 
consider the following (the operator before is defined in Section 3 . 3 ): 

h T‘^({T[*]; -naki] aki; T}(-iofc2 A X\ (0^2) before -•aki A XI (aki))) = 

{-ic[*]; c A T[*]; -ic[*]; c A -•aki; c A aki; 

-■c)*]; c A T}([-ic C/ cA-i-i[-i-'-'cA-'-i-'afci AX! ([-ic C/ c A aki]) U c A~'ak2 A 
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X\ ([“'C U c A 0 ^ 2 ]) A —'—'aki A X\ ([“ic U c A afci])] A A -iT U c A 

—'—'—'aki A X\ ([“ic U c A ofci])]]) 

Just looking at this suggests that boolean simplifications should be applied to the 
result of naive evaluation. Simple evaluation like this can provide a useful tool 
development aid, as concrete examples may provide insight into the semantics of 
clocking that is not immediately apparent from the general semantic definitions. 

3.2 Executing the Unclocked Formula Semantics 

In some cases w \= f can be executed directly, in other cases we have to first 
transform it into a different form. 

Boolean Expressions 

The semantics of boolean expressions can be directly evaluated. For example, 
[s]w; ^ a A & evaluates to a G s A 6 G s (if s were an explicit set rather than a 
variable then EVAL could reduce this further). 

Negations -■/, Conjunctions fi A/ 2 , and Next-State XI f 

To evaluate formulas, first note that w \= -•/ , w \= fi A f 2 and w \= X\ f can 
be rewritten directly using the semantics. For example, here are the results of 
invoking EVAL on p A XI f with three increasingly specific paths (in each case 
EVAL is applied to the term on the left hand side of the equation, and generates 
a theorem showing the evaluation of this term): 

\- w \= p A X\ f = (|w| > 0 A Wo ^ p) A |w;| > 1 A (w^) ^ / 

F {[so]w) \= p A XI f = so\=pA\w\ + l>lAw\=f 
I" SoSiS2S3S4S5SeSrSsSg \= p A X\ f = So h P ^ S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 h / 
These illustrate symbolic evaluation: when laws apply they are used to reduce a 
term, but if no laws are applicable then the term is left unevaluated: |w| + 1 > 0 
can be evaluated, since EVAL has been told F Vn. n + 1 > 0 = T, but |w| + 1 > 1 
cannot be evaluated for an arbitrary variable w, but the more specific term 
|soSiS2'S3S4S5S6'S7S8'S9| + 1 > 1 Can be evaluated: even though the states s^ are 
left as variables, since the path has length 10, which is greater than 0. With 
a fully concrete path the truth of the formula is completely determined. To 
display concrete examples we write {...}{...}•••{...} |= / where {. . .} are sets 
of atomic propositions representing states. Note that in such examples braces 
are set brackets, not part of SERE syntax. For example: 

F {a}{a,&}{&} \=aAXl{b) = T 

Until Formulas [/i U fg] 

The semantics of the until-construct is: 

W h [A Uf2] = 3ke{0 .. |w|). h /2 AVj G (0 .. k). hA 

which cannot be directly executed, but there is a standard recursive version of 
this definition that can easily be proved as a theorem and is directly executable: 

F W ^ [/i U / 2 ] = |w| > 0 A (w 1= /2 V W ^ /l A ^ [/l U fg]) 

The following example is from the Reference Manual [3, Example 2, page 45]. 
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time 0123456789 

clkl 0101010101 
a 0001110000 
b 0000010110 
c 1000011000 

clk2 1001001001 

Define wl to be this path, namely: 

wl = {c,clk2}{clkl}{}{clkl,a,clk2}{a}{clkl,a,b,c}{c,clk2} 

{c , clk2}{clkl ,b}{b}{clkl , clk2} 

Recall that weak clocking is defined by: /@c = -i(-i/@d). After making this 
definition we can evaluate examples like (icf *) ^ (c A A! {[a U b]))@dki and 
(wP) |A (c A A! (([a U b])@dki))@dk 2 for 0 < i < \wl \ and confirm that the 
first is true only when i is 4 or 5, and the second only when i is 0. The semantics 
of multiple clocking is subtle (clocks do not accumulate: an inner clock ignores 
an outer one) and is still under discussion and may change. Our tools facilitate 
experiments on concrete scenarios to gain insight into the current semantics. 

SufRx Implication {r}(/) 

Suffix implication formulas {r}(/) are executed by generating a matcher for r 
and then invoking EVAL on / whenever a match is found. In detail, the SERE 
r is first lifted to an element of a HOL theory of regular expressions (based on 
Nipkow’s Isabelle work [13], but with many details adjusted for PSL), and then 
a proof procedure lazily constructs the state set, accepting states and transition 
table of an equivalent DFA. This DFA is run along a finite trace w, and whenever 
it enters an accepting state EVAL is used to check a; |= / on the remaining trace x. 
The constants that do this lifting (sere2regexp) and DFA execution (acheck) are 
defined to be executed efficiently in the logic, but the following theorem shows 
that they also preserve the semantics of the original suffix implication formula: 
h Vw r /. ClockFree(r) ^ (w \= {r}(/) = acheck (sere2regexp r) {Xx. x \= f) w) 

Strong SufRx Implications {ri} i-A { 72 }! 

Strong implications {ri} 1 — >■ { 72 }! are reduced to suffix implications by:^ 

h Vw 7i 72. W 1= { 71 } { 72 }! = W ^ {d}(-'{72}(F)) 

Weak SufRx Implications { 71 } 1 — { 72 } 

Weak implications are executed by, if necessary, performing a reachability cal- 
culation inside HOL. We add a Prefix operator^ to the HOL regular expression 
theory, with the semantics that Prefix(7) matches a word w if it can be ex- 
tended by w' such that r matches ww' . We can now use our generic lifting and 
DFA execution constants to execute weak implication, and the following theorem 
guarantees that the semantics of the original formula are preserved: 

h Vw 7i 72. 

ClockPree(7i) A ClockFree(72) 

^ This equivalence was first observed by Dana Fisman (private communication). 

^ The prefix operators used for weak implication (Prefix) and abort (FormPrefix) are 
based on an idea from Dana Fisman (private communication). 
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w 1= {ri} {r 2 } = 

acheck (sere2regexp ri) 

{Xx. X ^ ~'{f 2 }(F) V amatch (Prefix (sere2regexp r 2 )) x) w 
The amatch constant checks whether a regular expression matches a word, by 
building an equivalent DFA, executing it along the word, and testing whether it 
is in an accepting state at the end. If the regular expression is Prefix(r), then the 
state s is accepting precisely when it is possible to reach an accepting state from 
s on the transition graph of (the DFA corresponding to) r. To implement this, we 
defined a version of Dijkstra’s reachability algorithm, and proved it correct [6]. 

To summarise, we execute w |= {ri} i— {r 2 } solely by deductions in the 
logical kernel. We first use the above theorem to reduce the problem to exe- 
cuting a DFA. This involves performing many on-the-fly deductions to evaluate 
transitions and accepting states. The Prefix operator is the most complex of 
these on-the-fly deductions, requiring a reachability calculation on the transi- 
tion graph. This reachability calculation can be reduced to an instance of Di- 
jkstra’s algorithm, but to make that step we need the correctness proof of the 
algorithm. The end result of all this deduction is a HOL theorem of the form 
h (w ^ {d} {^ 2 }) = b, where b is either T, F, or something more complex if 

the original term contained variables. 

Aborts / abort b 

We currently do not have a fully general method of executing w \= f abort b, 
but evaluation in some cases is possible. First define a formula prefix function 
Form Prefix and an auxiliary function AbortAux. 

FormPrefix w f = 3w'.ww' \= f 

AbortAux w f b n = 3j G n .. \w\ .w^ |= & A FormPrefix f 

then it is easy to prove: 

\- w \= f abort b = w\=f\/w\=bV AbortAux w f b 1 
F AbortAux w f b n = 

n < |w;| A (w" |= & A FormPrefix f V AbortAux w f b n + 1) 

and adding these to the rewrites used by EVAL enables / abort b to be executed 
in the trivial cases when w \= f or w \= b evaluate to true. For a non-trivial 
concrete example, consider the following (c.f. [14, Fig. 8, page 22]): 



time 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 



start 0 1 

req 0 0 

ack 0 0 

interrupt 0 0 



0 0 0 0 0 
0 0 10 0 
0 0 0 0 0 
0 0 0 0 0 



0 

0 

0 

0 



0 

0 

0 

0 



0 

0 

0 

0 



0 

0 

0 

1 



0 

0 

0 

0 



0 

0 

0 

0 



0 

0 

0 

0 



this corresponds to the finite path w2 where 



w2 = {}{start}{}{}{req}{>{}{}{}{}{interrupt}{}{}{}{}{} 



If we define: 



V/. F f = [T U f], V/. eventually \f = F f 

V/. G / = -(F (-/)), yf. always f = Gf 
then EVAL will prove: 



0 

0 

0 

0 



0 

0 

0 

0 
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\- w2 \= always{start — >■ {always{req — >■ eventually \ ack) abort interrupt)) 

FormPrefix ({s^art}{}{}{re(/}{}{}{}{}{}) -■[T U ~^^^^reqA^\T U ack]] 

The right hand side of this equation is true if path {start}{}{}{re(/}{}{}{}{}{} 
can be extended to make (~'[T U ->->->->req/\->\T U ack\\) true. In this particular 
case it is sufficient to only consider extensions either by the empty path or by 
a singleton path consisting of one state of the form {x}. The following easily 
proved theorem says that w; ^ / or 3a;.(w;[{a;}]) ^ / is sufficient, 
h FormPrefix w f = w |= / V (3a;.(w;[{a;}]) |= /) V FormPrefix w f 

For the example above, adding this to the equations used by EVAL results in a 
term 3x.-<{-<{ack = x) V {req = x) A ~i{ack = x)) being generated. EVAL can 
be programed to invoke a decision procedure on such terms. This is an ad hoc 
partial solution. We hope eventually to make our implementation complete. 

3.3 More Examples 

The first example below illustrates the utility of having an automatic semantics 
calculator. The second example shows how such a calculator can be used to 
analyse behaviours captured from simulation. 

3.3.1 An Example from an Accellera Online Discussion 

A recent online discussion [9,10] concerned the intervals for which the SERE 
(((a; b))@clki; c)@clk2 “hold tightly” within the behaviour: 

time 01234567 

clkl 01010101 
a 01100000 
b 00010000 
c 00001010 
clk2 10010010 

The Reference Manual introduces the terminology r “holds tightly” for w if and 
only if w 1= r. To understand this example, note that clocks don’t accumulate: 
only the current, i.e. innermost, one is used to sample the path. To analyse this 
example, a simple ML function can easily be written that evaluates a SERE on 
all sub-intervals of a path and returns the results that correspond to intervals 
for which the SERE holds. Using this we can analyse the example and deduce 
that the only interval where the SERE holds tightly is: 

h {clk2}{clki, a}{a}{clki, b, clk2}{c}{clki}{c, clk2} [= a; b@clki; c@clk2 = T 
This resolves the discussion in favour of the Manual ([10] is correct, [9] is wrong). 
Note that the other examples in the Manual can be (and have been) automat- 
ically checked. The fact that there is ongoing discussion about properties as 
simple as this suggests that our semantics calculator might be a useful tool for 
property writers. 
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3.3.2 An Example from the FoCs Manual 

Evaluation in HOL is nearly instantaneous for examples of the scale above. 
Whilst we would not claim our evaluator can handle ‘industrial scale’ problems, 
it can be applied to significantly more complex examples. In the IBM FoCs Man- 
ual there is a Sender-Buffer-Receiver in which the Sender (S) communicates with 
the buffer (B) using four-phase handshakes with request signal StoB _REQ and 
acknowledgement BtoS -AC K , and the Buffer communicates with the Receiver 
(R) with a four-phase handshake with request signal BtoR-REQ and acknowl- 
edgement RtoB-ACK. 

We can define in HOL a function FourPhase such that FourPhase req ack is 
true if signals req and ack satisfy properties required of a four-phase handshake. 
First define: 

Vr. never{r) = {TM;r}^{F} 
then define: 

FourPhase req ack = 

neuer(T[*]; ->req A ack] req) A neuer(T[*]; req A -•ack] -•req) A 
nener(T[*]; -•ack A -•req] ack) A never{J[*\] ack A req] -•ack) 

Definitions like FourPhase in HOL are analogous to definitions of verification 
units (vunits) in PSL. 

We have written a Verilog model to generate paths. If SimRun is a 700 
state Verilog generated path, our tool currently takes about a couple of minutes 
on a IGHz machine to evaluate: SimRun ^ FourPhase StoB _REQ BtoS -ACK 
and SimRun ^ FourPhase BtoR-REQ RtoB -ACK . Notice that both never and 
FourPhase have an initial T[*]. If we remove the occurrences of T[*] in FourPhase 
then the checking is more than twice as fast. If we augmented the rewrites used 
by EVAL to include: 

h Vw ri T2 T 3 . w 1 = (n; r2); rs = w \= n; (r2; rs) 
h Vw r. w \= = w; |= r[*] 

then this optimisation could be made to happen automatically. 

If, using the definition of G / given earlier, we define: 

[A w A] = [A [/ A] V G A, A before A = [^A W fi A -.A] 

then Ackinterleave acki ack2 defined below states that ack2 is asserted between 
any two ack\ assertions: 

Ackinterleave acki ack2 = 

{(T[*]; -•acki] acki)'\{-^ack2 A XI (00^2) before -•acki A X\ {acki)) 

Checking that the conjunction below evaluates to T takes about 5 minutes. 

SimRun |=^ Ackinterleave BtoS -ACK RtoB -ACK A 
SimRun |=^ Ackinterleave RtoB -ACK BtoS -ACK 

This corresponds to the vunit ack_interleaving in the FoCs Manual example. 




212 



M. Gordon, J. Hurd, and K. Slind 



4 Compiling the Formal Semantics 

In the last section we saw how to execute the formal semantics by deduction 
in the theorem prover. In particular, SEREs are executed by constructing a 
provably equivalent DFA. In the same way, some PSL formulas are equivalent 
to DFAs, where a violation of the formula corresponds to the DFA entering an 
accepting state. In this section, we show how to safely compile a subset of such 
PSL formulas as ‘checker modules’ in a HDL. An off-the-shelf simulation tool is 
then used to simulate the circuit together with the checker, and any violations 
of the property are detected and reported to the user. 

To illustrate the operation of the compiler, we will use part of the FourPhase 
property (introduced in Section 3.3). 

never{-'StoB _REQ A BtoS -ACK; StoB _REQ) 

This says that whenever StoB _REQ is low and BtoS -ACK is high, it is never the 
case that StoB-REQ will go high before BtoS -ACK goes low. By the definition 
of never (also in Section 3.3), this property holds if and only if the following 
SERE does not hold for any initial segment of the trace: 

J[*\]^StoB .REQ A BtoS -ACK; StoB.REQ 

If we convert this SERE to an equivalent DFA, it is easy to check whether it 
accepts any initial segment of a trace. We simply advance the DFA along the 
trace according to its transition function, and if it ever reaches an accepting 
state we report that the never property has been violated. 

To summarise, compiling the property never {r) reduces to generating an 
equivalent DFA to the SERE T[*]; r, and replacing accepting states with an 
error message reporting that the property has been violated. 

Let us now look more closely at the compilation process, to see how the 
semantics are preserved. We begin with the PSL formula never{r). We convert 
the SERE T[*]; r to an element of the HOL regular expression theory, and then 
to a DFA with a set of states, a subset of accepting states and a transition 
table. We intend to simulate this DFA concurrently with a circuit, and report 
an error whenever the DFA enters an accepting state. We can consider the circuit 
simulation to be producing an infinite trace, and the DFA effectively run on all 
initial segments of this. The following theorem shows that this mode of operation 
preserves the semantics of the original PSL formula never{r): 

h Vr w. ClockFree(r) A (|w| = oo) 

{w 1= never{r) = Vn. -lamatch (sere2regexp (T[*J; r))(w°’”)) 

The next step is the extraction of the DFA from HOL to an ML data-type, 
ready for a compiler back end to output code for a particular HDL. We use proof 
as much as possible in this function, because it increases our confidence in the 
correctness of the extracted DFA while incurring relatively little cost. The ML 
function that performs the extraction takes as input a list of atomic propositions 
and a regular expression, and returns for each reachable state of the DFA: (i) 
an integer state identifier, and the HOL term that represents the state, (ii) a 
boolean that is true for accepting states, and a HOL theorem proving this, and 
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(iii) a ‘condition’ data-type that encodes a series of tests on the truth value of 
atomic propositions followed by a transition to a new state, with HOL theorems 
proving the conditional transitions correct. 

Shown below is the ML output from applying the DFA extraction function 
to our example: R := sere2regexp (T[*]; -^StoB _REQ A BtoS -ACK] StoB _REQ) . 

[(0, ‘ [6] ‘ , (false, |- eval_accepts R [6] = F) , 

Branch("StoB_REq" , 

Leafd, |- !s. StoB_REQ IN s ==> (eval_transitions R [6] s = [4])), 
Branch("BtoS_ACK" , 

Leaf (2, |- !s. 

~(StoB_REQ IN s) /\ BtoS_ACK IN s ==> 

(eval_transitions R [6] s = [2; 4])), 

Leafd, |- !s. 

~(StoB_REQ IN s) /\ ~(BtoS_ACK IN s) ==> 
(eval_transitions R [6] s = [4]))))), 

(1, ‘ [4] ‘ , (false, |- eval_accepts R [4] = F) , ...), 

(2, ‘[2; 4] ‘ , (false, |- eval_accepts R [2; 4] = F) , ...), 

(3, ‘[0; 4] ‘ , (true, |- eval_accepts R [0; 4] = T) , ...)] 

For reasons of space only the transition function for state 0 (the initial state) is 
shown. The term representing this state is [6]^, the false indicates that this 
state is not accepting, and is followed by a theorem proving this. 

The condition first tests the atomic proposition StoB_REQ, and if true moves 
to state 1 (which as we see is represented in HOL as [4]). The conditional 
theorem at this leaf reflects this transition. 

From this language independent description of a DFA, it is a simple matter 
to generate versions in a HDL. We have implemented a pretty-printer for Verilog 
syntax. The resulting Verilog module for our example property is shown in Fig. 1, 
and it has correctly reported errors during simulations of a buggy buffer circuit. 



5 Conclusions and Future Work 

The main point of this paper is that a formal semantics is not just documen- 
tation. Current theorem provers are powerful enough to be programmed to ex- 
ecute semantics in interesting ways, though a major challenge is to engineer 
the deductions to be fast enough to be useful. We have illustrated this with 
two prototype tools. The first one could be useful for property developers and 
teachers and learners of PSL. The second one illustrates a novel way of imple- 
menting an EDA tool that guarantees conformance to the standard. We think 
such semantics-based tools could eventually be made efficient enough for indus- 
trial scale use, but one needs to choose applications where semantic accuracy is 
more critical than performance. The times (minutes) quoted in Section 3.3 will 
not impress members of the model checking community, but this doesn’t neces- 
sarily mean they are unacceptable, given the correct-by-construction benefits of 
the implementation method. 

® The values of the HOL terms representing states are an artifact of the DFA subset 
construction, and should be treated as arbitrary terms. 
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module Checker (StoB_REQ, BtoS_ACK, BtoR_REQ, RtoB_ACK) ; 



input StoB_REQ, BtoS_ACK, BtoR_REQ, RtoB_ACK; 

reg [1:0] state; 

initial state = 0; 



(StoB_REQ or BtoS_ACK or BtoR_REQ or RtoB_ACK) 
7,0d" , state); 



always 
begin 

$display ("Checker: state 
case (state) 



0: 


if 


(StoB_REQ) 


state 


= 1; 


else 


if 


(BtoS_ACK) 


state 


= 2; 


else 


state 


1: 


if 


(StoB_REQ) 


state 


= 1; 


else 


if 


(BtoS_ACK) 


state 


= 2; 


else 


state 


2: 


if 


(StoB_REQ) 


state 


= 3; 


else 


if 


(BtoS_ACK) 


state 


= 2; 


else 


state 



begin $display ("Checker: property violated!"); Ifinish; end 
default: begin Idisplay ("Checker: unknown state"); $finish; end 
endcase 
end 



1 

1 

1 



endmodule 



Fig. 1. The Verilog state machine for the example property. 

The work described here illustrates a convergence of computation and deduc- 
tion, in which the execution of theorem proving strategies becomes a powerful 
method of implementation. We plan to extend, package and ruggedise our pro- 
totypes into standalone tools that automatically invoke HOL (currently they are 
invoked from HOL via ML functions). The interpreter is complete excepts for 
aborts, but the checker only handles a subset of formulas. Our goal is to cover 
the whole of PSL. 
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Abstract. BDDs allow succinct symbolic representation of digital circuits. Sym- 
metry reduction factors out redundancy inherent in the regular organization of 
many systems. Both are successful techniques for combating state space explo- 
sion. It would be desirable to combine them into symbolic symmetry reduction. 
Unfortunately, the straight-forward approach to symmetry reduction requires the 
orbit relation, whose symbolic representation as a BDD is in general of exponential 
size. We investigate the use of generic representatives as a means of overcoming 
this problem for fully symmetric systems: instead of first representing the system 
as a BDD and then applying symmetry reduction, we translate the given pro- 
gram text into a symmetry-reduced version. The result can then be encoded using 
a BDD. We demonstrate that this method is superior not only to the traditional 
orbit-relation based symmetry reduction, but also to the approach using multiple 
representatives. 



1 Introduction 

Symbolic representation of systems, most notably in the form of binary decision diagrams 
(BDDs), is often more compact than explicit, enumerative representation. Symmetry re- 
duction is a powerful technique to limit the state space explosion problem. In symmetric 
systems, two states are considered equivalent if they are identical up to certain per- 
mutations of the participating processes. This relation gives rise to equivalence classes 
of states, called orbits. The Kripke structure built over the orbits can be shown to be 
bisimulation-equivalent to the structure built over individual states. 

The combination of symbolic representation with symmetry reduction was investi- 
gated in [CEEJ96]. The paper describes how the BDD for the representative function can 
be constructed, which maps a state to its unique orbit representative. Symbolic model 
checking in the presence of symmetry is then implemented by applying the representative 
function to the intermediate results during fixpoint evaluations. 

Computing the representative function requires the orbit relation, which contains 
pairs of states that are permutations of each other. The orbit relation turned out to be 
the bottleneck of symbolic symmetry reduction, since its BDD is, for many underlying 
symmetry groups, of size exponential in the minimum of the number of components and 

* This work was supported in part by NSF grants CCR-009-8141 and CCR-020-5483, and SRC 
contract 2002-TJ-1026. 



D. Geist and E. Tronci (Eds.): CHARME 2003, LNCS 2860, pp. 216-230, 2003. 
(c) Springer- Verlag Berlin Heidelberg 2003 
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the number of states per component. A partial remedy is to permit multiple representa- 
tives per orbit, which might be computable without using the orbit relation. However, 
choosing too many representatives per orbit defeats the purpose of symmetry reduction. 

To overcome the limitations of the above approach [CEFJ96] to combining symmetry 
reduction and symbolic representation using BDDs, we investigate, for fully symmetric 
systems, the use of generic representatives [ET99] as a means of avoiding the problems 
associated with picking representative states. Instead of first building a BDD for the 
system and then implementing symmetry reduction via the orbit relation, the symmetry 
is factored out at the source code level by compiling the original program into one 
that operates on counter variables. To keep track of equivalence classes of states, it is 
sufficient to store the number of processes in a given location, rather than their identities. 
For example, in a system with process locations N, T and C, the states {N, N, T, C), 
{N,C,T, N), and {T, N, N,C) are all symmetry-equivalent and can be represented 
generically as {2N, IT, 1C). 

In this paper we show how this idea can be applied to practical systems, where 
processes communicate via shared variables. In many applications, a global variable is 
used to point to one distinguished process, like one that possesses a token, or one that 
is currently allowed to enter its critical section. Since generic representatives get rid of 
process identities, such a variable must be adapted to a generic program. We show how 
this can be done by replacing it with a new variable that keeps track of the location of the 
distinguished process, rather than its identity. This method presents a slight challenge, 
though: if the distinguished process executes a transition, then its identity remains the 
same, but its location changes. This change must be reflected in the new variable. The 
complexity of a transition might therefore grow when translated into its generic form, 
although only by a small constant amount. 

We place suitable restrictions on the use of those global variables containing process 
indices in guards and actions in a program to ensure full symmetry. We also show 
the details of the program translation. We then define Kripke strucfures derived from 
fhe original and franslafed programs, respectively, and establish their bisimilarity. The 
generic method preserves all of the symmetry reduction, is applicable to a large class 
of fully symmetric systems and is efficient; in particular, it completely avoids the orbit 
relation. We demonstrate its usefulness in symbolic symmetry reduction by presenting 
experimental results for two systems with unique, multiple and generic representatives. 

The remainder of this paper is organized as follows. In section 2, we review traditional 
symmetry reduction with BDDs. In section 3 we illustrate, by means of an example, the 
notion of generic representatives. Section 4 formally describes how to translate a program 
given as a synchronization skeleton into an “equivalent” generic program. The translation 
of this new program into BDDs is the topic of section 5. We compare our method against 
other symbolic symmetry reduction techniques experimentally in section 6. Related and 
future work are discussed in the concluding section 7. 



2 Preliminaries 

Notation. For a, 6 G IN, we denote by [a..b] the set {i G IN : a < f < &}. For a 
permutation tt, the symbol tt~ stands for its inverse. In programs, we use an imperative 
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language style syntax. Block stracture is indicated by indentation (instead of begin/end); 
comments go from “//” to the end of the line. 

2.1 Permutations Acting upon States 

The ideas presented in this paper apply to process symmetries, which describe the phe- 
nomenon that in a system of replicated process components, processes can be rearranged 
in certain ways simultaneously in the source and target state of all transitions of the sys- 
tem, without changing the overall transition relation. This can be formalized as follows. 
The systems under consideration are similar to shared variable programs [ES96]. We 
assume there are n concurrently executing processes, following an interleaved model of 
computation, which share some global variables. The possible local states of a process 
are given by a set of process locations C. A system state can therefore be written as 
s = {v,li, . . . , In), where w is a (possibly tuple-valued) global variable and li € C 
is the location of process i. For L G C, we use as a shorthand for the expression 
li = L. The rearrangement of processes in a state is formalized in terms of a permuta- 
tion 7T : [l..n] — >■ [l..n] acting upon process indices. The mapping tt can be extended to 
act on system states by defining 7t(s) = {v'^ , l-w{n)), where v'" describes the 
result of 7T acting on v. The definition of v'^ depends on the character of v. some global 
variables, like a binary semaphore, are invariant under permutations, such that = v. 
On the other hand, a global token variable pointing to (the id of) some process is directly 
affected by tt, as we shall see in section 3. In this case, one defines = 7r(-u). 



2.2 Symmetry Reduction in Theory 

Given a set of permutations G acting on [l..n], a Kripke structure M = {S, R, sq), and 
a definition of 7r(s) for s G S, we say that M is symmetric with respect to G if, for all 
TT G G, tt{R) := {(7t(s), 7r(f)) : (s, t) G R} satisfies 7r(i?) C R. In this case, it can be 
shown that in fact tt{R) = R, and that G is a group with function composition as the 
group operation. If G contains all permutations over [l..n], M is called fully symmetric. 
This paper focuses on fully symmetric systems. G, the full symmetry group, is therefore 
henceforth omitted. 

The orbit relation 6{s, f) := Btt : 7r(s) = t defines an equivalence between states; 
the equivalence classes it entails are called orbits. Instead of considering all states in S, 
it suffices now to choose a small set Rep of representatives. This choice is reflected by 
the representative relation ^ C S x Rep, which assigns to every state in S elements of 
Rep such that: 

soundness: for all (s, r) G there exists tt such that 7t(s) = r (i.e. ^ C 9), and 
totality: for all s G S, there exists r G Rep such that (s, r) G 

The symmetry-reduced transition relation is obtained by replacing source and target of 
edges in R by representatives: 

R = {(s,F) G Rep X Rep : 3s, t G S : (s,s) G {t,f) G ^ A (s, f) G R}. (1) 
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The structure M := {Rep, R, sq)), for any sq with (sq, sq) G is called the quotient 
model of M. For suitable choices of Rep and ^ it can he shown that M is hisimulation- 
equivalent to M, and therefore 

M,sh/ ^ M,s^f (2) 

for any s such that (s, s) G ^ and every symmetric formula /: for all tt, every s G S and 
every maximal propositional subformula p appearing in /, M, s \= p ^ M, 7t(s) \= p. 
The “suitable choices” for Rep and ^ turn out to be crucial for efficiency. 

2.3 Unique Representatives 

It seems natural to pick exactly one representative from each orbit, such that the relation 
^ becomes a function. For instance, given a system state as an n-tuple over the process 
locations £, ^ could return the tuple with the locations sorted according to some 

ordering within C [LN91]. This mapping is sound, since sorting amounts to applying a 
permutation. It is also total, since every system state can be sorted in this way. Finally, 
the structure M derived from this choice of Rep is indeed bisimulation-equivalent to M. 

The only currently known way to construct a BDD for ^ with unique representatives 
is by first building the BDD for the orbit relation 9 and then projecting the second 
component of 9 onto Rep\ ^ = {{s,r) G 9 : r G Rep}. Unfortunately, this approach is 
generally problematic in terms of both time and space [CEFJ96] : The orbit problem — are 
two states related hy 91 — is at least as hard as the graph isomorphism problem, for which 
no polynomial-time algorithm is known. Making it worse for symbolic representations, 
the BDD of the orbit relation is, for many common symmetry groups, of size at least 
min{2”, 

2.4 Multiple Representatives 

A computationally less expensive choice of Rep and ^ is possible if the uniqueness 
requirement for the representatives is dropped. This approach imposes a few weaker 
constraints on Rep and which we sketch here briefly; for details see [CEFJ96]. 

Definition 1 ([CEFJ96]) Let Rep be a set of representatives and ^ a sound and total 
representative relation. A set C of permutations is complete if: 

— for all (s, r) G there exists it G C such that tt{s) = r, and 

- for all Tt G C and r G Rep, (7t(r),r) G 

Notice that if the representatives are unique as in 2.3, the full symmetry group G is a 
complete set. We hope, however, to find a small complete subset C. Intuitively, we can 
then restrict our attention to permutations from C in the search for representatives of a 
given state. 

Theorem 2 ([CEFJ96]) Let Rep and ^ be as in definition 1. If there exists a complete 
set C, then M, s \= f M , s \= f with M, s and f as in (2). 
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Fig. 1. The synchronization skeleton for a token solution to the Mutual Exclusion problem 



In practice, it is the programmer’s responsibility to first define a set Rep representable 
by a small BDD. In [CEFJ96], it is described how a suitable set C can be derived. By 
finally defining ^ as 

(s,r) S ^ iff r G Rep A Btt G C : 7t(s) = r, (3) 

C is a complete set for Rep and If the expression Btt G C : 7r(s) = r and R can 
also be encoded succinctly, then the BDD for R as in (1) is small; the orbit relation is 
nowhere used. By theorem 2, we can now perform model checking on M. 

The symmetry reduction effect is negatively impacted by choosing several represen- 
tatives per orbit. While this could still be advantageous when using BDDs, it is not clear 
that Rep can always be chosen to allow a small BDD for ^ and R. In the remainder of 
this paper, we argue that in the case of full symmetry, a solution exists that avoids all 
these problems altogether. 

3 Generic Representatives - A Case Study 

Full symmetries occur frequently in practice, whenever a system is composed of un- 
ordered, pairwise interchangeable components. This is the case for clique networks of 
processes, but also for bus and star topologies, where components communicate via a 
centralized hub (such as in cache coherence protocols). In the latter cases, the bus or hub 
can be “factored out”, while full symmetry reduction can be applied to the remaining 
processes. 

A fully symmetric system is concisely specified by the number n of processes, 
possible global variables with initial values, and the common program executed by all 
processes. As an abstraction of this program, we assume, for the purpose of describ- 
ing the formal translation into BDDs, the input model of synchronization skeletons. 
These skeletons are appropriate and powerful enough to describe most control-intensive 
synchronization problems over finite domains. Combinations of values for the local vari- 
ables of a process are abstracted into a local state; assignments to those variables are 
represented as local state changes. Sequential code executed by a process in an atomic 
action is abstracted into a single transition. 

As an example, consider a token-based solution to the n-process Mutual Exclusion 
problem with a global variable tok G [l..n], and the skeleton in figure 1. A skeleton’s 
arcs can be labeled with guards (shown in the diagram above the arc) and actions (shown 
below it, executed after the transition). The skeleton in the figure allows a process to 
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Variables: 

nN,riT,nc : [0..n] 

TOK : {AT, T, C} 

// from transition 
// N ^T: 
if njv > 0 
if TOK = N 
if njv = 1 
TOK := T 
else 

TOK := {N,T} 
un ~ nM — 1 
tlt := tit + 1 

Fig. 2. Generic version of the token-based Mutual Exclusion solution 

enter its critical section C if it currently possesses the token {tok = self). Upon leaving 
C, it sets tok to a nondeterministic value in The skeleton gives rise to a fully 

symmetric structure, as we will see in the next section for skeletons written in a specific 
input syntax. 

We now want to construct a new program based on counters that yields a bisimula- 
tion-equivalent structure. Instead of a local state variable for each process, we somewhat 
conversely declare global counter variables for each local state, calling them un, tit, 
nc- A slight challenge is provided by the tok variable with range [l..n]. Since the counter 
variables deliberately ignore process identities, we cannot check a guard like tok = self 
any more. However, assume there are several processes in location T. Since they are 
indistinguishable, it does not matter which of them has the token (if any). Rather, it 
suffices to remember, in a new variable TOK, the location of the process possessing the 
token. Thus, TOK ranges over {N,T,C}. 

The translated program consists of the variables and statements shown in figure 2. 
The values of the counter variables range from 0 to the number of processes, n. The 
initial values of all four variables follow from the fact that all processes start out in 
location N . All transitions in the new program require that the counter of the source 
state is positive, since the transition can be taken only if there is a process in that state. 

The first transition, N ^ T, has apparently nothing to do with the token, since 
tok does not explicitly appear in it. However, the process executing it might be the one 
possessing the token, in which case the new variable TOK must be updated from N 
to T. If TOK = N and rzAr = 1, then the executing process has the token, and we 
set TOK to T. If TOK = N and tiat > 1, then the process executing the transition 
may or may not be the one possessing the token, so we must set TOK to T, or TOK 
must remain at N, respectively. Hence, the new program has two transitions in this case, 
which we abbreviate by a nondeterministic assignment TOK := {N,T}. Finally, the 
actual location change is reflected by decreasing njv and increasing tit- A similar, but 
simpler reasoning motivates the translation of the other two transitions; in particular, 
the condition til > 0 in the assignment to TOK in the last statement ensures that only 
locations in which there is at least one process are nondeterministically chosen. 



Initial values: 

{nN,riT,nc) := (n, 0,0) 
TOK := N 



// from transition 

H rj, to k=se lf ^ 

if ? 1 T > 0 A TOK = T 
TOK ~ C 
tit •= tit — 1 
nc :=nc + l 



// from transition 
H C ^ N , tok := ndet[l..n\\ 
if nc > 0 
nc := nc — 1 
njv := njv -f 1 
TOK := ndet{L : nc > 0} 
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The property to be verified also needs to be translated into counters. As an example, 
compare the mutual exclusion (safety) and communal progress (liveness) requirements 
in specific and generic notation: 

specific generic 

Safety: AGVf,j : i ^ j : ~^{Ci A Cj) AG(nc < 2) 

Liveness: AG(3iTi =:> AF3jCj), AG{riT > 0 => AFnc > 0). 

The liveness property states that if there is some process in its trying region, then 
in any possible future, there should eventually be some process entering its criti- 
cal section. This property is weaker than progress of an individual process, formally 
AG Vi : {Ti => AF Ci). The latter formula, however, is not symmetric, since the max- 
imal propositional subformula Ti is not invariant under permutations. It can therefore 
not directly be verified over a symmetry reduced structure (whether specific or generic). 
One approach to overcoming this problem is to “factor out” one of the processes and 
treat its local variables as global. The progress property is formulated for this process, 
and symmetry reduction is applied to the remaining ones. This approach is described 
in more detail by Pnueli, Xu, and Zuck [PXZ02], incidentally for counter-abstracted 
programs. 

To see that implementing the above translation is tantamount to performing symmetry 
reduction on the program text, notice that all states from one equivalence class of the 
original system are mapped by the translation to the same counter tuple ( TOK, un, nr, 
nc ) ■ This tuple can therefore be viewed as an “unusual notation” for the representative of 
the orbit — we call it a generic representative. The new program can now be transformed 
into a Kripke structure, represented by BDDs, and model checked. 

4 Translating Symmetric Programs into Generic Form 

The global variable tok in the previous section contains a process index, which is lost 
after the introduction of counters. Such variables require special treatment during the 
translation process. We call them id-sensitive. Global variables independent of process 
identities, for example a boolean semaphore, are, as we shall see, much simpler to handle. 
We refer to them as id- independent variables. 

We assume a program P in the form of the following parameters: (1) the number n 
of processes, (2) any number of id-independent global variables, given as a single vector 
V with range V (cross product of individual ranges), along with initial value a?o, (3) any 
number z of id-sensitive global variables, given as d = {di, . . . , dz) with range [l..n]^, 
along with initial value feo, and (4) a synchronization skeleton. The latter is a finite 
directed graph, each node of which represents (and is identified with) a process location; 
call their number 1. One of the nodes, /q, is the distinguished initial location of every 
process. The edges may be labeled with a guard and an action (which default to true 
and no-op, respectively). 

Syntax of Guards. Guards are arbitrary propositional combinations of boolean- valued 
basic guards, the latter being conditions on process locations and expressions over global 
variables. In order to ensure full symmetry of the structure entailed by the program, basic 
guards must meet certain criteria. 
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Table 1. Fully symmetric basic guards on process locations 



no. 


Basic Guard 


Generic version 


Meaning 


0 


Vi : -^Li 


o 

II 


none 


1 


Vi : Li 


riL = n 


all 


2 


3*, j : i A j : Li A Lj 


riL >2 


at least two 



Definition 3 For a quantified boolean formula h over atoms of the form Li, i G 
and a permutation tt on define 7r(/i) by tt acting upon the indices. Formula h is 

fully symmetric ifh^ 7t(/i) is valid. 

Some basic guards satisfying this definition are listed in table 1 . As an example, the guard, 
exactly one process is in location L, formally (3i : Li) A (Vi, j : Li A Lj ^ i = j), 
is equivalent to -lO A ->2, where 0 and 2 are two basic guards from the table. It is more 
succinctly written as = 1 in generic terms. 

Any (syntactically valid) expression over id-independent global variables is “by 
nature” fully symmetric and thus a legal basic guard. As for an id-sensitive variable d, 
we allow the expressions d = self and d self as basic guards. 

Syntax of Actions. An action consists of at most one assignment to each of the global 
variables. The execution model for the assignments — e.g. parallel or sequential — is left 
to the implementation, since it is irrelevant for the translation of the source program into 
generic representatives. 

As with guards, to ensure full symmetry the syntax of actions must be restricted. 
Any (syntactically valid) assignment to the id-independent variables is legal, since it 
does not affect the symmetry of the program. For an id-sensitive variable d we allow the 
following three types: 

d := self d := ndet[l..n] d := ndet{[l..n]\{self}). 

The last two actions intuitively assign a nondeterministic value in the given set to d. 
Their precise semantics is given by the derivation of a Kripke structure: 

Definition 4 A program specified in the above syntax defines a Kripke structure M = 
(S', i?. So) with S = V X [l..n]^ x [!..(]", sq = (tco: kg, Jg, . . . , Iq). ond R containing 
all pairs (s, t) with 

s — (at, . . . , li—\, A, (j+i, . . . , In)} f — (^ }k , (i, . . . , li— 1 , B , (i-pi, . . . , In) 

such that there is an edge e: A ^ B in the skeleton with a guard that evaluates to true 
for V = X, d = k, self = i and process locations as in s, and e’s action A satisfies 
the Hoare triple (v = x)A(v = x'), and for each id-sensitive variable d with values k 
and k' in s and t, resp., A has an assignment d := self and k' = i, or an assignment 
d := ndetZ for some Z with k' G Z, or A has no assignment to d and k' = k. 

The following theorem shows that symmetry reduction can be applied to M. 

Theorem 5 For s = {x,ki, . . . , kz, (i, • ■ ■ , In), let 7t(s) = {x, 7r“(fci), . . . , TT~{kz)} 
Itt(i), . . . ,lTr(n)) eind 7r(i?) as in section 2.2. M is fully symmetric, that is, 7r(i?) C R 
for all permutations tt. 
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We are now ready to describe the translation of program P from its components (1) 
through (4) (beginning of this section): The new program P consists of the same variable 
V with initial value further variables dj, j € [l..z], with range [l.J] and common 
initial value Iq, and variables ni, . . . ,ni with range [0..n] and initial values n/p = n, 
riL = 0 for L ^ Iq. Every edge of the skeleton is translated into a statement as follows: 

if UA > 0 A gen{guard) 
updatel {guard) 

riA :=riA-l ( 4 ) 

riB := n_B + 1 
update2 {action) 

The condition ua > 0 ensures that there is a process in location A. The guard is translated 
by a function gen as follows: each basic guard on process locations is replaced according 
to table 1. For an id-sensitive variable dj, guard dj = self is replaced by dj = A, guard 
dj self hy dj A\/ Ua > ‘2 (if Ua > 2, there is a process i in location A with 
dj ^ V, hence dj ^ self is true for that process). Expressions over v are unchanged. 

Function updatel performs updates of variable dj that become necessary because 
of the location change. It is only required if action does not assign to dj (otherwise, dj 
is overwritten by updateE {action)). 




guard 


dj = self 


dj self 


otherwise (including true) 


updatel {guard) 


jl 


no-op 


if dj = A 
it Ua = 1 
dj := B 

else 

dj := ndet{A, B} 



Function updateE implements updates of v and dj that are due to the action. It 
leaves no-op and assignments to v unchanged. For the assignments to dj, we translate 
as follows: 



action 


dj := self 


dj := ndet{[l..n]\{self}) 


dj := ndet[l..n] 


updateE 

{action) 


dj := B 


if ns = 1 

dj := ndet{{L : > 0}\{B}) 

else 

dj := ndet{L : ns > 0} 


dj := ndet{L : ns > 0} 



Definition 6 Program P defines a Kripke structure M = {S,R,sq) with S = V x 
[1..^]^ X [l..n]^ So = {xq, Iq, ■ ■ ■ , Io,ni, . . . , ni) such that njg = n, hl = 0 for all 
L Iq, and R containing all pairs (s, t) such that there is a (nondeterministic) statement 

in V whose top-level condition ua > 0Agen{guard) evaluates to true and that contains 
an execution that, applied to s, results in t. 
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Theorem 7 Structures M (definition 4} and M are bisimulation-equivalent via 

b . S y S ^ b(x j h\ ^ ^ kz ^ Iit • • • fn) — ni') 

with riL ■= \{j G [l..n] : Ij = L}\. Function b maps every state to its unique generic 
representative. The following theorem shows that although generic representatives are 
not based on permutations, they dehne the same equivalence classes as the orbit relation: 



Theorem 8 For any r,s € S, b{r) = b(s) if and only if3n : 7r(r) = s. 

In order to model check over structure M, the specification must be rewritten in 
generic notation. We assume it is a CTL formula whose atomic propositions are fully 
symmetric expressions on local state variables (translated like the examples in table 1) 
and expressions on the id-independent global variable (unchanged). Such a formula is 
symmetric in the sense dehned right below (2). 

Note that the translation of the program as well as of the formula can be done fully 
automatically, in time linear in the size of the program text. 



5 Translating Generic Programs into BDDs 

In this section, we show how the statements of the generic program, obtained in section 4, 
can be encoded in a BDD efficiently. We will also estimate the sizes of those BDDs, 
depending on n, I and the size of the input synchronization skeleton. In this section we 
ignore the existence of the id-independent variable v: since expressions involving it are 
subject to no restrictions, BDD sizes cannot be estimated. However, those expressions 
are not altered during the translation; hence they do not contribute any change in BDD 
size. 

The generic structure M = (S, R, Sq) is the disjunction of statements of the form in 
(4) in section 4. BDDs implementing those statements can be obtained as follows: 

ua > 0 iff there is at least one true bit among the [log(n -f 1)] bits representing ua- 
This can be implemented as a disjunction over all those bits. The resulting BDD 
size is linear in the number of participating bits: 0(log n). 
gen(guard) is a propositional combination ofbasic generic guards. Guards from table 1 
can be realized as above with a BDD that compares the constant bit-wise against 
the counter variable; size 0(log n). Basic generic guards involving the id-sensitive 
variable have the form d = Aox d Ay tia >2, which can again be verihed bit- 
wise; these BDDs thus have maximum size 0(log I log n) (d G ua G [0..n]). 
Let F denote the number ofbasic guards appearing in guard. The total BDD size for 
this part of a transition is then no more than 0((log I log n)^). Since F is typically 
a small constant, this bound is usually polynomial in practice. 
update 1 (guard): an if-then-else statement can be implemented using the common 
ITE operation for BDDs. Since the expressions contained inside the if-then-else 
are again comparisons against constants, the entire statement can be encoded in a 
BDD of size 0(log“ I ■ log^ n), for small constants a and (3. 
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ul ■= fiL i since the right-hand side is not a constant, a bit-wise comparison is 
not possible. The increment can be implemented by searching (using existential 
quantification) for a bit position i at which is 0, n'^ (the next-state value) is 1, 
for all preceding bits til and are identical, and for all succeeding bits ni, is 1 
and is 0. The worst-case BDD size over two variables of [log(n + 1)] input bits 
j5 22riog("+i)l = 0{n^). 

update2 {action): assignment d := ndet{L : ul > 0} can be realized with a BDD for 
the expression q{uL > 0 A d' = L) of size 0((log nlog ()^). The BDD for 

the if-then-else statement then has size 0(log^ n • (log nlog 0^0- 

Assuming (very defendably) that F is a small constant, we can see that all parts of the 
translation of an edge can be expressed with a BDD that is low-degree polynomial in 
n, although, with respect to I, it can be of order (log (caused by the d := ndet{L : 
Ul > 0} statement). The complexity of the overall transition relation depends on the 
way the individual statements are combined, but it is guaranteed to be polynomial in n 
as well. 

It is interesting to investigate how the relative sizes of n and I influence the benefit 
of generic representatives. Because of the n log I input variables of the BDDs for the 
specific representatives algorithm (n variables of range for 6, ^ and K), hence a 
maximum specific BDD size of roughly F, it can be assumed that the generic method 
is most useful if n is larger than 1. Asymptotically, this is the case if (is a constant and 
n is considered variable. This situation occurs frequently in practice, since, for a given 
application, the number I of local states is often fixed. Our second experimental case, 
presented in the next section, is such an instance. 

6 Experimental Results 

We compare traditional to generic symmetry reduction using two examples: 

The first is an artificial Mutual Exclusion scenario that allows us to show how the 
generic method scales for varying values for n and 1. Each process can be in one of 
the local states L^, . . . ,L\ where and take the roles of the trying region and 
critical section, respectively. The process must go through to in this order before 
proceeding into L* . In addition, the transition into L* is protected by a binary semaphore, 
which is released again upon the process’ return to L^\ 



Transition 


Guard 


Action 


L* ^ for 1 < i < ( - 2 


true 


no-op 




!sem 


sem := 1 


L' ^ 


true 


sem := 0 



As a second example, we chose a variant of the MCS list-based queuing lock with 
atomic compare _and_swap instruction [MCS91, also used in ID96]. The algorithm con- 
sists of an acquire and a release operation for a lock with the property that a process 
waiting for the lock spins only on process-local variables, instead of spinning on a shared 
variable (like a semaphore). According to [MCS91], spins on shared variables can cause 
memory detention and severe system performance degradation. 
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For the second example, the input was not a synchronization skeleton, but the program 
text for the two operations. In order to perform counter abstraction on this symmetric 
system, the number of local states needs to be determined. The acquire operation forces 
processes to line up for the lock in a queue. Each process remembers its successor, which 
can be any of the n — 1 other processes, such that the number of local states of a process 
is not constant. While forming a queue is a valuable property for enforcing a special type 
of liveness on the processes, it is less relevant for the verification of safety properties. 
We therefore generalized the system so as to allow any process that is “ready” to obtain 
the lock to do so. Since the safety property — no two processes can acquire the lock at 
the same time — turned out to be true for this conservative abstraction with a constant 
number of 28 local states, we conclude that it holds in a system that enforces FIFO order. 

For both problems, we experimented with unique, multiple and generic representa- 
tives. For multiple ones, we chose the set Rep as follows: 

r G Rep 3i : 1 <i <l ■. process 1 is in location U A 

locations iJ with j < i do not appear in r. 

For example, using I = 3, the states and 

belong to the set Rep, but are not unique representatives, in which the 
superscripts have to be in order. It turns out that the BDD for the representative relation 
^ derived from Rep can be computed much more efficiently than that for the function ^ 
for unique representatives. Booking back at definition 1, the complete set C to be cho- 
sen contains the n permutations that swap index 1 with index f, for 1 < f < n. It 
can be shown that C indeed satisfies the two properties required in definition 1 . C is 
exponentially smaller than the full symmetry group. 

For the first example, we verified the standard safety property: AGVf, j : i j : 
^{L\ a (generically: AG n; < 2). For the second example, we verified fhaf no fwo 
processes can acquire fhe lock af fhe same lime, and also that there is no deadlock in the 
system. The latter means that it is never the case that all processes are simultaneously 
spinning in one of the two busy-waits that are present in the operations. Such a situation 
would cause a deadlock since a process can not free itself from a busy-wait, but can only 
be unlocked by another processes. 

These properties were verified using the CUDD BDD package [SOI] for the standard 
symbolic fixpoint characterization of EF bad. Table 2 shows how the space requirements 
and running times of the three methods of symmetry reduction compare. 

Discussion. First, for multiple and generic representatives, it can be seen that there 
is still room to grow memory-wise, but not necessarily so for unique representatives. 
Indeed, the main motivation for research on alternatives to unique representatives was 
the impractical BDD size of the orbit relation. 

Further, the unique representatives approach spends nearly all of its time on the orbit 
relation construction. The use of multiple representatives clearly reduces memory and 
time requirements. The generic representatives solution outperforms, by several orders 
of magnitude, the other two both in terms of memory and time, and hence in the size 
of problems it can handle. According to the table, although multiple representatives do 
remedy the major disadvantage of an orbit relation based solution somewhat, generic 
representatives have in turn an equally impressive benefit over multiple ones. 
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Table 2. Space and run time comparisons (i686/1400 Mhz PC, 256MB memory) 





Choice 


Unique Specific 


Multiple Specific 


(Unique) Generic 




of n, 1 


Representatives 


Representatives 


Representatives 




n 


1 


no. of live 


time in sec. 


no. of live 


time 


no. of live 


time 




BDD nodes 


(% orbit rel.) 


BDD nodes 


in sec. 


BDD nodes 


in sec. 


M 


8 


4 


114,894 


8.2 (97%) 


2,211 


0.0 


703 


0.0 


U 


6 


5 


2,152,710 


137.3 (97%) 


6,612 


0.1 


690 


0.0 


T 


16 


16 


7 


>15h(100%) 


132,377 


6.6 


4,876 


0.0 


E 


64 


16 


— 


— 


599,561 


198.8 


6,972 


0.1 


X 


128 


128 


— 


— 


7 


>15h 


69,060 


10.4 




256 


128 


— 


— 


— 


— 


78,060 


12.6 


M 


3 


28 


113,188 


2.4 (79%) 


30,614 


0.2 


1,340 


0.0 


C 


4 


28 


9,478,195 


4386.7 (95%) 


75,604 


0.5 


2,608 


0.0 


s 


8 


28 


7 


>15h(100%) 


272,080 


15.4 


7,320 


0.3 


- 


16 


28 


— 


— 


2,417,477 


5055.3 


24,094 


2.7 


L 


20 


28 


— 


— 


7 


>15h 


34,170 


5.0 


K 


60 


28 


— 


— 


— 


— 


293,981 


266.8 



7 Conclusion 

In this paper, we investigated the use of generic representatives in symbolic model 
checking of fully symmetric systems. Compared to unique representatives, with generic 
ones there is no need to construct the orbit relation. Compared to multiple representatives, 
the generic ones maintain full symmetry reduction. The BDD derived from the generic 
structure M turned out to be small for the examples we experimented with. For the class 
of programs presented here, the translation into generic representatives can be done 
automatically and in negligible time. 

Generic representatives seem to prove useful outside the symbolic domain as well: 
we translated some of the fully symmetric example programs coming with the Mur:/? 
explicit state verifier [DDHY92] into generic representatives. For some examples, we 
obtained savings in terms of both time and space of several orders of magnitude over 
Mur(/?’s symmetry reduction algorithms (using unique or multiple representatives). 



Related and Future Work. Earner and Grumberg [BG02] considered combining sym- 
metry and symbolic representation using BDDs mainly for falsification. They perform 
reachability analysis by discarding states symmetric to previously seen states. How- 
ever, due to orbit complexity problems, the algorithm uses multiple representatives and 
therefore forgoes some of the symmetry reduction possible. Also, according to [BG02], 
computation costs often incur the use of under-approximations of the set of reached 
representatives, which renders the algorithm inexact. 

Finite counters have been used previously to abstractly represent states of systems 
with many processes. Pnueli, Xu and Zuck [PXZ02] used truncated counters with values 
0, 1, or 2 to approximate the number of processes in certain locations in reasoning about 
symmetric parameterized systems. Emerson and Trefler [ET99] used counters in the 
form of generic representatives in connection with fully symmetric programs. Other 
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examples can be found in the work by Emerson and Srinivasan [ES90] on synthesis of 
parameterized programs and in the work by Pong and Dubois [PD95] on cache protocol 
verification. 

Several years ago, Ip and Dill [ID96] introduced scalar sets in the description of 
the input program to enforce full symmetry. The Mur<p verifier is an explicit- sfafe im- 
plementation of this approach. Since Muri^ was originally not designed to exclusively 
target symmetric systems, Mur<^’s input language is more general. In addition to non- 
symmetric programs, it allows one to write programs exhibiting symmetry other than 
process symmetry, which is discussed in this paper. To make our approach more readily 
applicable, we would like to allow a more convenient input language than synchroniza- 
tion skeletons, perhaps similar to that of Mur(/?. 

The present formulation of generic representatives is directly only applicable to (the 
common case of) fully symmetric systems. We would like to do research on systems 
whose symmetry group is the product of full symmetry groups of subsystems [CEJS98, 
section 5.1], and systems that are almost, but not fully, symmetric [ET99]. The ultimate 
goal is to apply the generic method to some larger, perhaps industrial-size examples. 
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Abstract. Intrusion-tolerance is the technique of using fault-tolerance 
to achieve security properties. Assuming that faults, both benign and 
Byzantine, are unavoidable, the main goal of Intrusion-tolerance is to 
preserve an acceptable, though possibly degraded, service of the over- 
all system despite intrusions at some of its sub-parts. In this paper, we 
present a correctness proof of the Intrusion-tolerant Enclaves protocol [1] 
via an adaptive combination of techniques, namely model checking, the- 
orem proving and analytical mathematics. We use Murphi to verify au- 
thentication, then PVS to formally specify and prove proper Byzantine 
Agreement, Agreement Termination and Integrity, and finally we math- 
ematically prove robustness of the group key management module. 



1 Introduction 

A substantial progress in the formal verification of cryptographic protocols has 
been achieved during the last decade. A wide variety of techniques has been de- 
veloped to verify a number of key security properties ranging from confidentiality 
and authentication to atomic transactions and non-repudiation [2,3]. Neverthe- 
less, all the focus was either on two-party protocols (i.e., involving only a pair 
of users) or, in the best cases, on group protocols with centralized leadership 
(i.e., a presumably trusted fault-free server managing a group of users). In the 
present work, we are concerned with the verification of the intrusion-tolerant 
Enclaves [1]: a group- membership protocol with a distributed leadership archi- 
tecture, where the authority of the traditional single server is shared among a set 
of n independent elementary servers, of which at most / could fail at the same 
time. The protocol has a maximum resilience of one third (i.e., / < 
uses an algorithm similar to the consistent broadcast of Bracha and Toueg [4] . 

The primary goal of Enclaves is to preserve an acceptable group-membership 
service of the overall system despite intrusions at some of its sub-parts. For 
instance, an authorized user u who requests to join an active group of users 
should be eventually accepted, despite the fact that faulty leaders may coordi- 
nate their messages in such a way as to mislead non-faulty leaders (the majority) 
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into disagreement, and thus into rejecting user u. Moreover, in order to prevent 
malicious leaders from leaking sensitive information (e.g., group keys) or provid- 
ing clients with fake group keys. Enclaves uses a verifiably secure secret sharing 
scheme. 

To achieve its intrusion-tolerant capabilities. Enclaves relies on the combina- 
tion of a cryptographic authentication protocol, a Byzantine fault-tolerant leader 
agreement protocol and a secret sharing scheme. Although we assume the un- 
derlying cryptographic primitives and fault-tolerant components to be perfect, 
one cannot easily guarantee security of the whole protocol. In fact, several pro- 
tocols had been long thought to be secure until a simple attack was found (see 
[20] for a survey). Therefore, the question of whether or not a protocol actually 
achieves its security goals becomes paramount. To date, most of the research in 
protocol analysis has been devoted to finding attacks on known, either two-party 
or centralized protocols. In this paper we are concerned with the verification of 
a distributed multi-leader group communication protocol. 

An important issue that arises in formal verification of Byzantine fault- 
tolerant protocols, is the modeling of Byzantine behavior. How much power 
should be given to a Byzantine fault and how general should the model be to 
capture the arbitrary nature of a Byzantine fault behavior? These questions 
have been extensively studied [7,9,10] and continue to be a center of focus. In 
this paper, faults are only limited by cryptographic constraints. For instance, 
faulty leaders can arbitrarily send random messages, reset their local clocks and 
perform any action without satisfying its precondition. They cannot, however, 
decrypt a message without having the appropriate key, or impersonate other 
participants by forging cryptographic signatures. More details about our fault 
assumptions are discussed in Section 2. 

In this work, we discuss a formal analysis of the overall Byzantine fault- 
tolerant Enclaves protocol. We experiment with an adaptive combination of 
techniques, chosen according to the nature of the correctness arguments in 
each module, the environment assumptions, and the easiness of performing 
verification. For instance, we found it more profitable to model-check the au- 
thentication module by taking advantage of the reduction techniques available 
in Murphi [15]. The Byzantine leaders agreement module, however, was a little 
trickier. In fact, the latter relies, to a large extent, on the timing and the 
coordination of a set of distributed actions, possibly performed by Byzantine 
faulty processes whose behavior is hard to represent in a model-checker. Instead, 
we use PVS [21] and formalize the protocol in the style of Timed-Automata 
[5]. This formalism makes it easy to express timing constraints on transitions. 
It also captures several useful aspects of real-time systems such as liveness, 
periodicity and bounded timing delays. Using this formalism, we specified the 
protocol for any number of leaders, and we proved safety and liveness properties 
such as Proper Agreement, Agreement Termination and Integrity. Finally, the 
group-key management module is based on a secret sharing scheme whose 
security relies fundamentally on the hardness of computing discrete logarithms 
in groups of large prime order. Due to the hardness of expressing the latter 
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correctness arguments in a formal language, we found it more convenient to 
give a manual proof of the module’s robustness and unpredictability properties, 
using the Random Oracle model [19]. 

The remainder of this paper is organized as follows. In Section 2, we give an 
overview of the architecture and design goals of Enclaves, and we explicitly state 
our system model assumptions. In Section 3, we describe the model checking of 
the authentication module in Murphi. In Section 4, we present how we model 
the elementary components of the Byzantine leader agreement module in PVS 
and how we build the final protocol model out of these ingredients. In Section 5, 
we formulate and prove our correctness theorems. In Section 6, we briefly give 
the mathematical proof of robustness and unpredictability of the group key 
management module. In Section 7, we discuss some related work. Finally in 
Section 8, we conclude the paper by commenting on our results and stating 
some perspectives for future work. 



2 The Enclaves Protocol 

Enclaves [1] is a protocol that enables users to share information and collabo- 
rate securely through insecure networks such as the Internet. Enclaves provides 
services for building and managing groups of users. Access to a given group is 
granted only to sets of users who have the right credentials to do so. Authorized 
users can dynamically, and at their will, join, leave, and rejoin, an active group. 
The group communication service relies on a secure multicasting channel that 
ensures integrity and confidentiality of group communication. All messages sent 
by a group member are encrypted and delivered to all other group members. 

The group-management service consists of user authentication, access con- 
trol, and group-key distribution. Figure 1 shows the different phases of the pro- 
tocol execution. Initially at time t^, user u sends requests to join the group to a 
set of leaders. These leaders locally authenticate u within time interval [ti,t 2 \- 
When done, the agreement procedure starts and terminates at time t^ by reach- 
ing a consensus as whether or not to accept user u. Finally on acceptance, user 
u is provided with the current group composition, as well as information to re- 
construct the group-key. Once in the group, each member is notified when a new 
user joins or a member leaves the group in such a way that all members are in 
possession of a consistent image of the current group- key holders. 

In summary. Enclaves should guarantee the following properties, even in the 
presence of up to / corrupted leaders: 

— Proper authentication and access control: Only authorized users can join the 
group and an authorized user cannot be prevented from joining the group. 

— Confidentiality of group communication: Messages from a member u can be 
read only by the users who were in u’s image of the group at the time the 
message was sent. 
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The description of Enclaves in [1] assumes a reliable network where messages 
eventually reach their destinations within an upper bound delivery time. In this 
paper we make the same assumptions. Concerning the intruder, we adopt a stan- 
dard model where an intruder fully monitors the network, proactively augments 
its knowledge, and chooses to send, either adaptively or randomly, messages on 
the network. The intruder, however, cannot block messages from reaching their 
destination and is limited by cryptographic constraints. For instance, the in- 
truder cannot decrypt messages without having the right key, or impersonating 
other participants by forging cryptographic signatures. For the leaders agree- 
ment module, in particular, we assume the cryptography layer to be perfect 
(i.e., messages format is well chosen to prevent any leakage of sensitive informa- 
tion) , and we concentrate rather on the Byzantine fault-tolerance capabilities of 
the protocol. 

Given the above assumptions, we prove that the Proper authentication and 
access control requirement holds through (1) the model checking of the Proper 
Authentication invariant in Murphi (cf. Section 2), and (2) the proofs of Proper 
Agreement, Agreement Termination and Agreement Integrity theorems in PVS 
(cf. Sections 3 and 4). In addition, we prove the Confidentiality of group com- 
munication requirement via a mathematical analysis of the Robustness and Un- 
predictability properties of the group key management module of Enclaves (cf. 
Section 6). 
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3 Model Checking Authentication in Murphi 

Murphi has a language that supports scalable models. In a scalable model one 
typically starts with a small protocol configuration and gradually increases the 
protocol size. In many cases, errors in the general protocol (possibly infinite 
state) will also show up in down-scaled (finite state) version of the protocol. 
The Murphi tool is based on explicit state enumeration and supports a number 
of reduction techniques such as symmetry and data independency [16,17]. The 
desired properties of a protocol can be specified in Murphi by invariants. If a 
state is reached where some invariant is violated, Murphi prints an error trace 
exhibiting the problem. 

Our verification has been conducted as follows. First, we formulated the 
protocol by identifying the protocol participants, the state variable and messages, 
and the key actions to be taken. Then we added an intruder to the system. In 
our model, the intruder is a participant in the protocol, capable of eavesdropping 
messages in transit, decrypting cipher-text when it has the appropriate keys, and 
generating new messages using any combination of previously gained knowledge. 
Finally, we stated the desired correctness conditions and ran the protocol for 
some specific size parameters. 

The main property we are concerned about in this paper is mutual au- 
thentication between a given pair of leader and client. More precisely, at the 
end of a protocol execution between a leader Li and a client C, Li should 
be able to assert that it has been talking, indeed, to client C, and vice- 
versa. The verification has been done by means of invariant checking un- 
der the above mentioned assumptions. The client proper authentication in- 
variant is given below. It basically states that for each leader i, if it com- 
mitted to a session with a client, this client (whose identifier is stored in 
lead[i].client), must have started the protocol with leader z, i.e., have stored 
i in its field leader and be awaiting for acknowledgment (i.e., in state 
C^CK). 



invariant "client proper authentication" 
forall i: Leaderld do 

lead [i] . state = L_C0MMIT & 
ismember (lead[i] .client, Clientid) 

-> 

clnt [lead [i] . client] . leader = i & 
clnt [lead [i] . client] . state = C_ACK 
end; 

In addition to the above invariant, we have checked a similar one for leaders 
proper authentication (i.e., the clients are sure about the identity of the lead- 
ers they are communicating with) . Table 1 shows the number of reached states 
and CPU run times taken on a 440 Mhz Sparc machine with 256 MB of mem- 
ory for different sizes of the protocol. The instances we consider, have been 
chosen to emphasize the weight of each size parameter. For example, the in- 
truder is modeled to be very powerful (intercepts, replays, and generates mes- 
sages), so adding a second intruder does not increase the intrusion power, it 
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Table 1. Model checking experimental results 



Number of 


Network size 


States 


CPU time 


Clients Leaders Intruders 


2 


4 


1 


1 


4591 


13.25 s 


2 


4 


1 


3 


125793 


331.00 s 


1 


4 


2 


3 


277176 


1481.35 s 


4 


10 


1 


3 


797000 


- 



just multiplies the complexity. Also, the last row in Table 1 , shows a non con- 
clusive result, where Murphi runs out of memory before reaching all possible 
states. 

4 Modeling Byzantine Agreement in PVS 

Most group communication protocols, including Enclaves, can be modeled by 
an automaton whose initial state is modified by the participants’ actions as 
the group mutates (new members join). Because Enclaves depends also on time 
(participants timeout, timestamp group views, etc.), it was convenient to model 
it as a timed automaton. In the current verification, timing is used only to 
ensure actions progress. Timing, however, is essential to prove upper bounds on 
agreement delays (e.g., a maximum join delay), but this is beyond the scope of 
this paper. Participants in a typical run of Enclaves consist of a set of n leaders 
(/ of which are faulty), a group of members, and one or more users requiring to 
join the group. 

In the remainder of this section, we first explain our general PVS theory 
about timed automata. The parameters of this theory are used here to formalize 
Enclaves by defining the actions, the states, and the precondition and effect of 
each action. Finally, the resulting executions of the protocol and fault assump- 
tions are described. 

4.1 Timed Automata 

We present a general, protocol-independent, theory called Timed Automata. 
Given a number of parameters, it defines all possible executions of the pro- 
tocol as a set of Runs. A run is a sequence of the form sq A si A S2 A S3 -I . . . 
where the Sj are states, representing a snapshot of the system during execution 
and the Oj are the executed actions. A particular protocol (an instance of the 
timed automaton) is characterized by sets of possible States and Actions, a con- 
dition Init on the initial state, the precondition Pre of each action, expressing 
in which states that action can be executed, the effect Effect of each action, 
expressing the possible state changes by the action, and a function now which 
gives the current time in each state. In a typical application, there is a special 
delay action which models the passage of time and increases the value of now. 
All other actions do not change time^. 

^ For more details about the PVS theories and proofs, we refer the reader to the web 
page: http:/ /hvg.ece.concordia.ca/Research/CRYPTO/Enclaves.html 
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4.2 Leaders Actions 

To define the actions of the leaders, we first state a few preliminary definitions. 
Let n be the number of leaders and let / be such that 3/ + 1 < n (the maximum 
number of faulty leaders). For simplicity, leaders are identified by an element of 
{0, 1, . . . , n — 1}. Users are represented by some uninterpreted non-empty type, 
and time is modeled by the set of non-negative real numbers. 

The actions of the protocol are represented in PVS as a data type, which 
ensures, e.g., that all actions are syntactically different. Thereafter, we define 
the following actions: 

~ A general delay action which occurs in all our timed models; it increases the 
current time (now), and all other clocks that may be defined in the system, 
with the amount specified by a delay parameter del. 

— An announce action is used to send announcement messages of new locally 
authenticated users to the other leaders of the protocol. 

— A trypropagate action allows a user announcement to be further spread 
among leaders. This action is executed periodically, but it only changes the 
state of the system if enough announcements (/ -I- 1) have been received for 
the considered user and it has not already been announced or propagated 
by the leader in question before. 

— An action Tryaccept used to let leaders periodically check whether they have 
received enough announcements and/or propagation messages for a given 
user. Once this condition is satisfied, the user is accepted to join the group. 

— A receive action allows a leader to receive messages; it removes a received 
message from the network and adds corresponding data to the local buffer 
of the leader. 

— A crash action models the failure of a leader. After a crash, a leader may still 
perform all the actions mentioned above, but in addition it may perform a 
misbehave action. 

— An action misbehave models the Byzantine mode of failure and can only be 
performed by a faulty (crashed) leader. 

Besides, we define three time constants for the maximum delay of messages in the 
network, the maximum delay between trypropagate actions and the maximum 
delay between tryaccept actions. 

4.3 States 

In order to properly capture the distributed nature of the network, it is suitable 
to model two kinds of states: a local state for each leader, accessible only to the 
particular leader, and a global state to represent global system behavior which 
includes the local state of each leader, the representation of the network and a 
global notion of time. 

An important part of the local state is the group view, which is a set of users 
in the current group. In fact, the ultimate goal of Enclaves is to assure consistency 
of the group views. Moreover, we use a Boolean flag (faulty) marking the leader 
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status as faulty or not, some local timers {clockp and clocka) to enforce upper 
bounds on the occurrence of trypropagate and tryaccept actions, and finally a 
list {received) of the leaders from which the local leader received proposals for 
a given user. 

Views : TYPE = setof [Userids] 

LeaderStates : TYPE = 

[# view : Views, 

faulty : bool, 

clockp : Time, % clock for the trypropagate action 

clocka : Time, 7, clock for the tryaccept action 

received : [Userids -> list [Leaderlds] ] #] 

We model Messages as quadruples containing a source, a destination, a proposed 
user and a timestamp indicating an upper bound on the delivery time, i.e., the 
message must be received before the tmout value. 

Messages : TYPE = [# src : Leaderlds, 

tmout : Time, 

proposal : Userids, 

dest : Leaderlds #] 

In the global states, the network is modeled as a set of messages. Messages 
that are broadcast by leaders are added to this set, with a particular time-out 
value, and they are eventually received, possibly with different delays and at a 
different order at recipient ends. The global state also contains the local state of 
each leader and a global notion of time, represented by now. 

GlobalStates : TYPE = [# Is : [Leaderlds -> LeaderStates] , 

now : Time, 

network : setof [Messages] #] 
s, sO, si : VAR GlobalStates 

Furthermore, we define a predicate Init that expresses conditions on the initial 
state, requiring that all views, received sets and the network are empty, and all 
clocks and now are set to zero. 

4.4 Precondition and Effect 

For each action A, we define its precondition, expressing when the action is 
enabled, and its effect. An announce action may always occur and hence has 
precondition true. Similarly for trypropagate and tryaccept, which should occur 
periodically. Action receive{i) is only allowed when there exists a message in 
the network with destination i. For simplicity, a crash action is only allowed 
if the leader is not faulty (alternatively, we could take precondition true). A 
misbehave action may only occur for faulty leaders. 
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Most interesting is the precondition of the del ay (t) action. This action in- 
creases now and all timers {clockp and clocka) by t. To ensure that messages are 
delivered before their time-out value, we require that the condition prenetwork, 
defined below, holds in the state before any delay{t) action is taken, which fits 
our informal assumptions about network reliability. 

prenetwork (s , t) : bool = FORALL msg : 

memberCmsg, network(s)) IMPLIES now(s) + t <= tmout (msg) 

Similarly, there is a condition preclock which requires that all timers {clockp 
and clocka) are not larger than MaxTry Propagate and MaxTry Accept, re- 
spectively. Since the trypropagate and tryaccept actions reset their local timers 
to zero, this may enforce the occurrence of such an action before a time delay is 
possible. 

Next we define the effect of each action, relating a state sq immediately before 
the action and a state si immediately afterwards. 

— delay{t) increments now and all local timers by t, as defined by Sq + 1- 

— announce{i, u) adds, for each leader j a message to the network, with source 
i, time-out now{so) + MaxMcssagcDclay, proposal u, and destination j. 

— trypropagate{i) resets clockp to zero and adds to the network messages, 
to all leaders, containing proposals for each user for which at least f + 1 
messages have been received. 

— tryaccept{i) resets clocka to zero and adds to its local view all users for 
which at least (n — /) messages have been received. 

— receive{i) removes a message with destination i from the network, say with 
source j and proposal u, and adds j to the list of received leaders for u, 
provided it is not in this list already. 

— crash{i) sets the flag faulty of i to true. 

— misbehave{i) may just reset the local timers clockp and clocka of i to zero, 
as expressed by ResetClock{so,i, si), or it may add randomly as well as 
maliciously chosen messages to the network (provided that timeouts are not 
violated). A misbehaving leader, however, cannot impersonate other protocol 
participants, i.e., any message sent on the network has the identifier of its 
actual sender. 



4.5 Protocol Runs and Fault Assumption 

Runs of this timed automata model of Enclaves are obtained by importing the 
general timed automata theory. This leads to type Runs, with typical variable r. 
Let Faulty{r, i) be a predicate expressing that leader i has a state in which it is 
faulty. It is easy to check in PVS that once a leader becomes faulty, it remains 
faulty forever. Let FaultyN umber {r) be the number of faulty leaders in run r 
(it can be defined recursively in PVS). Then we postulate by an axiom that the 
maximum number of faults is / (MaxFauIts : AXIOM FauItyNumber (r) <= f). 
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5 Proving Byzantine Agreement in PVS 

We are interested in verifying the following properties of the Enclaves protocol: 

— Termination: if user u wants to join an active group and has been an- 
nounced by enough non-faulty leaders, then eventually user u will be ac- 
cepted by all non-faulty leaders and become a member of the group. 

— Integrity: a user that has been accepted in the group should have been 
announced by a non-faulty leader earlier during the protocol execution. 

— Proper Agreement: if a non-faulty leader decides to accept user u, then 
all non-faulty leaders accept user u too. 

In the remainder of this section, we briefly outline proofs of the above theorems. 

Theorem 1 (Termination) 

For all r and u, announced_by_many(r ,u) implies accepted_by_all(r,u) 
where 

— aimounced_by_many(r ,u) expresses that at least (/ -I- 1) non-faulty leaders 
announced user u during run r; 

— accepted_by_all(r, u) asserts that eventually all non-faulty leaders have 
user u in their view during run r. 

Proof. Assume announced_by_many(r ,u), which implies that at least {f + 1) 
non-faulty leaders broadcast a proposal for u. Because of the reliability of the 
network, eventually these messages will be delivered to their destination, and 
in particular to the (n — /) non-faulty leaders of the network. They all receive 
(/-|- 1) announcement messages for user u, which is enough to trigger the propa- 
gation procedure (for u) for all non-faulty leaders who did not participate in the 
announcement phase. Now because of the network reliability, we conclude that 
eventually all non-faulty leaders will receive at least (n — /) approvals for user 
u, enough to make a majority, since {n — f) > f follows from n > 3/. □ 

Theorem 2 (Integrity) 

For all r and u, accepted_by_one(r,u) implies announced_by_one(r ,u) 
where 

— accepted_by_one(r,u) holds if at least one leader eventually included u in 
its view during run r. 

— announced_by_one(r ,u) expresses that at least one non-faulty leader an- 
nounced user u during run r; 

Proof. We proceed by contrapositive and use the non-impersonation property. 
We assume that for all non-faulty leaders no announcement for user u has been 
done during run r. Now because of non-impersonation, faulty leaders cannot 
send more than / different announcements. This implies that the leaders would 
receive no more than / announcements for user u, which is not enough to trigger 
propagation actions. This yields that u will never be proposed by any of the non- 
faulty leaders, and hence none of them will receive as much as (n — f) messages 
for u (recall (n — /)>/). As a result, user u will never be accepted by any of 
the non-faulty leaders. □ 
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Theorem 3 (Proper Agreement) 

For all r and u, accepted_by_one(r,u) implies accepted_by_all(r ,u) 

Proof. accepted_by_one (r ,u) implies that there exists a non-faulty leader 
that received at least (n — /) approvals (i.e., announcements or propagation 
messages) for user u. Among these approvals, at least (n — 2/) come from non- 
faulty leaders (by non-impersonation) . Now because these leaders are non-faulty, 
they broadcast the same approval to all the other leaders. In addition, because 
of the network reliability, these messages are eventually delivered to destination. 
This implies that all (n — /) non-faulty leaders receive eventually the above 
(n — 2/) approvals. Since (n — 2/) > (/ -I- 1), all (n — /) non-faulty leaders have 
received at least {f + 1) messages for u. Similar to the proof of Termination, the 
latter implies the start of the propagation procedure, then the reception of at 
least {n— f) approvals for user u, and finally the acceptance of u by all non-faulty 
leaders. □ 

The above proofs were conducted successfully in PVS and required over 40 
lemmas. Integrity and Termination were the most challenging to prove and they 
helped deduce Proper Agreement. 

6 Group Key Management: Mathematical Proof 

In the previous sections we discussed authentication and leaders agreement. We 
saw also that once the leaders agree on accepting a client C, they proceed with 
providing it with a group key. We direct our focus here to the Enclaves group 
key management module [1]. This module is based on a secret sharing scheme 
which ensures that (1) the / dishonest leaders cannot obtain the group key even 
if they conspire altogether (at least (/ + 1) shares are needed to reconstruct the 
secret); (2) the group key is renewed every time the group changes (new join 
or leave); and (3) the clients are able to discern valid key shares from fake ones 
(possibly issued by malicious leaders). 

The group key management protocol of Enclaves is based on previous work 
of Cachin et al. [19]. The security property of the protocol relies on the hardness 
of computing discrete logarithms in a group of large prime order. Such a group 
Gq can be constructed by selecting two large prime numbers p and q such that 
p = 2q + 1 and defining Gq as the unique subgroup of order q in Zp*. The 
protocol works as follows. Initially, we assume that a dealer chooses a generator 
g of Gq and a random secret integer x G l^q. The dealer then generates n shares 
xi, • • • , G Zq using an /-threshold ^ Shamir’s secret sharing scheme [18]. The 
dealer secretly transmits the shares Xi to their corresponding leaders and makes 
public hi = g®* for all leaders We denote hy g = H{G) the output of a 

hash function H applied to the most recent set of clients forming the group G. 
In this scheme, the secret group key to be reconstructed by the clients is g^ . 

In addition to p, q and g, we assume that H is also known to all the participating 
leaders. Given the above assumptions, the protocol works as follows: 

^ The secret cannot be reconstructed unless (/ -I- 1) shares are available. 
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1. Leader Li picks randomly s € Zq and computes (a,b) = (g‘,g^)- 

2. Leader Li, then, computes c = H'{yi,g,a,b), where yi = g^% and with 
H' \ Gq'^ ^ hq a, public hash function. 

3. Now leader Li computes r = s + cXi and sends each client the quadruple 
{yi, a, b, r), that is the share yi and the proof of validity (a, b, r). 

4. Now the client computes c' = H'{yi,g,a,b), supposed to be equal to c, and 
accepts the share yi only if the following equations hold: 

g'' = a hi'' ( 1 ) 

~g^ = b y/ (2) 

Let S be any set of / + 1 (or more) shares yi that a given client has received. For 
simplicity, assume S = {yi,y 2 , ■■■, Vf+i}- We denote by (ai)i<i</+i the Lagrange 
interpolation coefficients^, such that ~ where a* = H 

Given the above shares, the clients recover the secret group key as follows: 

r = g^^*=' = n = n 

i=l 

6.1 Security Analysis: Manual Proof 

We sketch proofs of two key properties, namely, robustness and unpredictability. 



Theorem 4 (Robustness) In the random oracle model a dishonest leader 
cannot forge, with a non-negligible probability, a valid proof for a non valid share. 

Proof sketch: Let yi be the share provided by leader Li and (a, b, r) be the 
corresponding correctness proof. yi,a,b and r should then satisfy the following 
equations: 



ff’’ = a hi" (3) 

r = b y." (4) 

where c = H'{yi, a, b, g). Equation (3) yields a G Gq, since hi" and g^ are both in 
Gq (Closure of Gq under multiplication). The latter implies that it exists j £ I^q 
such that a = g'^ . Equation (3) gives: g" = g~*g"'"' , which implies: r = 7 + cXi. 
Now equation (4) becomes: 

~g" = byi^^ = b y," 

^ g^ b-^ = {~g-^' Vi)" 

This yields two possible cases: 

® The ai depend only on the leaders indexes and hence are pnblicly known. 

^ In this model, the hash function can be seen as an oracle prodncing a random value 
at each query. If the same query is asked twice, an identical answer is given [19]. 
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1. Hi = In this case, the share is correct, b = g'^ and for all c € Zq the 
verifier equations trivially hold. 

2. Ui ^ g^\ In this case, we must have c = log(g-xi y^){g~^ b~^). 

Once the triplet {yi,a,b) is chosen, if yi is not a valid share, then there ex- 
ists a unique c G l^q that satisfies the verifier equations. In the random oracle 
model, the hash function H' is assumed to be perfectly random. Therefore, 
the probability that H'{yi,a,b,g) equals c, once (yi,a,b) fixed, is On the 
other hand, if the attacker performs an adaptively chosen message attack by 
querying an oracle Af times, the probability for the attacker to find a triplet 
iVi, a, b), such that c = H'{yi, a, b, g), is Vsuccess = 1 “ (1 “ « y for large 

q and Af. Now if k is the number of bits in the binary representation of q, then 
V Success < Since a computationally bounded leader can only try a poly- 
nomial number of triplets, then when k is large, the probability of success is 

negligible {V Success = < 1). □ 

Theorem 5 (Unpredictability) An attacker that corrupts up to f leaders 
cannot, with a non-negligible probability, learn the secret group key g^ . 

This has been proved by Cachin et al. [19] and relies on both: 

— The perfect cryptography assumption (i.e., conditional entropy is no greater 
than simple entropy) 

S{y^f+, \y^,,y^ 2 ,■■■, yq) = for all j < f 

— The Computational Diffie-Hellman assumption [22] , which states that there 
is no polynomial time probabilistic algorithm that computes yi = g^' given 
g, g, and hi = g^' , with a non-negligible probability of error. 

As a result, the knowledge of up to / shares does not help the attacker to 
predict any extra valid shares. Therefore, the data to which an attacker might 
have access is not sufficient to reconstruct the group key with a non-negligible 
probability of error. 



7 Related Work 

Much work has been done to formally verify fault-tolerance in distributed proto- 
cols. Some of these verifications deal with the Byzantine failure model [7], while 
others remain limited to the benign form [8]. A variety of automata formalisms 
has been adopted to specify such protocols. 

Castro and Liskov [7] specified their Byzantine fault-tolerant replication al- 
gorithm using the I/O automata of Tuttle and Lynch [6]. They have manually 
proved their algorithm’s safety, but not its liveness, using invariant assertions 
and simulation relations. This work, although similar to our Byzantine agree- 
ment module, has never been mechanized in any theorem prover. 
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Kwiatkowska and Norman [9] analyzed the Asynchronous Binary Byzantine 
Agreement [19] (based on a concept similar to our key management module) us- 
ing a combination of mechanical inductive proofs (for non-probabilistic proper- 
ties) and finite state checks (probabilistic properties) plus one high-level manual 
proof. Our approach, too, takes advantage of the easiness and performance of 
the different earlier mentioned techniques to prove the overall Enclaves protocol. 

Timed automata were also used to model the fault-tolerant protocols PAXOS 
[11] and Ensemble [14]. The authors assume a partially synchronous network 
and support only benign failures. This bears some similarities with our Enclaves 
verification in the sense that we assume some bounds on timing, but unlike the 
work in [11,14] we are dealing with the more subtle Byzantine kind of failure. 

In [13], Archer et al. presented the formal verification of some distributed 
protocols using the Timed Automata Modeling Environment (TAME). TAME 
provides a set of theory templates to specify and prove I/O automata similar to 
those we use in our specification. 



8 Conclusion and Future Work 

This paper reports results about the formal verification of an Intrusion- Tolerant 
protocol. We experimented with an adaptively chosen combination of techniques 
based on the nature of the correctness arguments in each module of the protocol, 
the environment assumptions and the easiness of performing verification. 

We believe to have achieved a promising success in verifying a complex pro- 
tocol such as Enclaves. Nevertheless, our results could be improved further in 
various aspects. For instance, the feasibility of model checking is always limited 
to instances with a finite number of states, which may, in some cases, prevent 
from discovering security flaws in realistic implementations of the protocols. This 
can be improved by the use of rank functions [2]. We believe that using rank 
functions is a very efficient way to mechanically prove authentication properties 
and we are considering it among our future work plans. 

Thanks to the high level of expressiveness of the Timed-Automata formalism, 
as well as the rich datatype package of PVS, we succeeded to formalize the 
Byzantine agreement module for any number of leaders, in a way that thoroughly 
captures the many subtleties on which the correctness arguments of Enclaves 
rely. We have proved the protocol to satisfy its requirements of Termination, 
Integrity and Proper Agreement. Yet, we have not proved the consistency of group 
membership when members leave the group. This is also among our future work. 
Finally, one promising direction for further development would be to perform the 
mathematical analysis mechanically in PVS. This requires the elaboration of 
some general purpose theories (e.g., probabilities) not yet available in PVS. The 
current specification can be further extended by widening the Byzantine faults 
capabilities and by introducing the joint cryptographic layers that have been 
abstracted away. Also results about an upper bound on Agreement establishment 
delays can be further investigated. 
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Abstract. We propose new, tractably (in some cases provably) efficient algo- 
rithmic methods for exact (sound and complete) parameterized reasoning about 
cache coherence protocols. For reasoning about general snoopy cache protocols, 
we introduce the guarded broadcast protocols model and show how an abstract 
history graph construction can be used to reason about safety properties for this 
framework. Although the worst case size of the abstract history graph can be ex- 
ponential in the size of the transition diagram of the given protocol, the actual 
size is small for standard cache protocols as is evidenced by our experimental 
results. The framework can handle all 8 of the cache protocols in [19] as well 
as their split-transaction versions. We next identify a framework called initial- 
ized broadcast protocols suitable for reasoning about invalidation-based snoopy 
cache protocols and show how to reduce reasoning about such systems with an 
arbitrary number of caches to a system with at most 7 caches. This yields a prov- 
ably polynomial time algorithm for the parameterized verification of invalidation 
based snoopy protocols. Our results apply to both safety and liveness properties. 
Finally, we present a methodology for reducing parameterized reasoning about 
directory based protocols to snoopy protocols, thus leveraging techniques devel- 
oped for verifying snoopy protocols to directory based ones, which are typically 
are much harder to reason about. We demonstrate by reducing reasoning about a 
directory based protocol suggested by German [17] to the ESI snoopy protocol, a 
modification of the MSI snoopy protocol. 



1 Introduction 

Cache protocols provide a vital buffer between the ever growing performance of pro- 
cessors and lagging memory speeds making them indispensable for applications such as 
shared memory multi-processors. Unfortunately, cache protocols are behaviorally com- 
plex. Ensuring their correct operation, in particular that they maintain the fundamental 
safety property of coherence so that different processes agree on their view of shared 
data items, can be subtle. The difficulty of the problem is often magnified as the number 
n of coordinating caches increases. Moreover, it is highly desirable that a cache protocol 
be correct independent of the magnitude of n. There is thus great practical as well as 
theoretical interest in uniform parameterized reasoning about systems comprised of n 
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homogeneous cache protocols so as to ensure correctness for systems of all sizes n. 
This general problem is known in the literature as the Parameterized Model Checking 
Problem (PMCP). It is, in general, algorithmically undecidable, but of great practical 
importance, which has led to many heuristics and algorithms for particular cases. In this 
paper, we present new, tractably (in some cases provably) efficient algorithmic methods 
for exact parameterized reasoning about cache coherence protocols. 

First, for reasoning about general snoopy cache protocols, we introduce the guarded 
broadcast protocols model wherein processes coordinate using broadcast primitives plus 
boolean guards. A broadcast transmission corresponds to a cache putting a message on the 
bus; reception of such a message corresponds to snooping the bus and taking appropriate 
action. Boolean guards make it possible to model protocols (e.g., Illinois-MESI, Firefly, 
Dragon) that need to determine the presence or absence of the required memory block in 
other caches. We show how an abstract history graph construction can be used to reason 
about safety properties of guarded broadcasts. In the construction, a path x leading to 
global state s is represented as a tuple of the form {a, A) G Sx2^ , where S is the set of 
local states of the given cache protocol, that reflects not merely the local states present 
in s but also takes into account the local transitions that were fired along x to get to 
s, viz., the history of s along x. The extra historical information, that our construction 
stores, permits us to reason about safety properties for an arbitrary number of caches in 
an exact fashion as opposed to the standard abstract graph construction [24] that only 
takes into account the set of local states present in s and is thus sound but not guaranteed 
complete. We establish a path correspondence between concrete computations of the 
original system and paths in the abstract graph which also allows us to automatically 
generate error traces once an erroneous ‘abstract state’ is detected. In the worst case, 
the size of the abstract graph may be exponential in the size of the state diagram of the 
given cache protocol, thus enabling us to reason about the more expressive framework 
of guarded broadcast for the same worst case time complexity as ordered broadcasts. In 
practice, however, the abstract graph tends to be small as is documented by our empirical 
results. 

Next we consider the PMCP for invalidation based snoopy protocols, viz., protocols 
that on a write operation invalidate the memory block being written to in all other caches 
of the given system [7]. We model such protocols using the new framework of initialized 
broadcast protocols. For this model, we consider the PMCP for formulae of the form 
!\i^j and Aiyij j), where h{i,j) is a LTL\X formula over a pair of 

distinct processes. For such formulae, we show how to reduce reasoning for a system 
with an arbitrary number of processes to systems with at most a cutojfim fact 7) number 
of processes. This yields a provably polynomial time algorithm (in the size of the state 
diagram of a single cache unit) for reasoning about the PMCP for a broad class of linear 
time properties of invalidation-based protocols, not just safety. Also the use of cutoffs 
has the important advantage that the large system with, say 100 caches, is very much like 
the small system with 7 caches. This provides a simple reduction from n to 7 processes 
that automatically caters to error recovery. 

Finally, we consider the PMCP for directory-based protocols wherein information 
regarding cache states of individual memory blocks is stored in a centralized directory 
and all transaction regarding cache state lookup, invalidations, updates etc., take place 
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across a network. We use the observation that for most directory based protocols there 
exists a snoopy protocol with exactly the same states [7] and running essentially the 
same protocol except that the implementation of each snoopy broadcast transition is 
broken up into several steps. Since the executions of steps corresponding to different 
snoopy broadcasts can interleave among themselves, it makes directory-based proto- 
cols behaviorally more complex and thus seemingly harder to verify than their snoopy 
counterparts. However, we demonstrate, using a directory based protocol suggested by 
German [17], that since all transactions are serviced via the centralized directory, it 
leads to a serialization of steps of snoopy broadcasts in a way that there is limited over- 
lap among steps corresponding to different snoopy broadcasts. We can then establish 
path correspondences between computation paths of directory based protocols and their 
snoopy counterparts thereby allowing us to reduce the PMCP for linear time properties 
from directory based protocols to snoopy ones. Thus techniques developed for reasoning 
about parameterized snoopy broadcasts can now be leveraged. As an example, we show 
how to reduce reasoning about this directory based protocol to the ESI snoopy protocol, 
a modification of the MSI protocol, which was verified using the abstract history graph 
construction in less than 0.01 secs. 

The rest of the paper is organized as follows. We begin by introducing the system 
model in section 2. In section 3, we present the abstract history graph construction 
for verifying safety properties of guarded broadcast protocols while cutoff results for 
initialized broadcast protocols are given in section 4. In section 5, we demonstrate, 
using the protocol suggested by German, how to reduce reasoning about directory based 
protocols to snoopy protocols. Applications and experimental results are given in section 
6 and we end with some concluding remarks in section 7. 



2 The System Model 

We consider systems of the form, f7", comprised of finite, but arbitrarily many, copies 
of a process template, U , executing asynchronously with interleaving semantics. The 
template U is formally defined by the 4-tuple U = {S, S, R, i), where S' is a finite, non- 
empty set of states’, 27 is a finite set of labels including the internal transition label r, and 
broadcast send and receive labels of the form [!! and 111, respectively; R is the transition 
relation; and i the initial state. Each transition of R is either an internal transition of the 
form a b, a broadcast send of the form a b, or a broadcast receive of the form 
a —A b, where g is a boolean guard. 

We assume that receives are deterministic, viz., for each label [!! appearing in some 
broadcast send and for each state a in S, there is a unique corresponding receive transition 
on 111 out of a. The guard g labeling a transition tr of R is either the boolean expression 
true or the specialized conjunctive guard /\(\), or the specialized disjunctive guard 
Y ^(i), where i is the initial state of U . We assume that the guard is true for receive 
transitions. In practice, the above mentioned guards suffice in modeling cache coherence 
protocols as each cache only needs to know whether another cache has the memory block 
it requires, expressed using the specialized disjunctive guard, or whether no other cache 
has it, expressed using the specialized conjunctive guard. 
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To capture block replacement behavior, we also require that templates be initializ- 
able} This means that from each state o of a protocol, there is an unguarded, internal 
transition of the form a — i. Such initializations model block replacement behavior, 
where a cache is non-deterministically pushed into its invalid state, irrespective of the 
current state of the block. For simplicity, re-initialization transitions and self-loop re- 
ceptions are not drawn in state transition diagrams of cache protocols (cf. [7]). 

We now introduce the following frameworks (a) Initialized Broadcast Protocols for 
dealing with invalidation based snoopy protocols, and (b) Guarded Broadcast Protocols 
for dealing with general snoopy cache protocols, by specifying the types of broadcast 
transition allowed. The two frameworks are incomparable in that each framework can 
model a protocol that the other cannot. 

Initialized Broadcast Protocols. There are two major classes of snoopy cache protocols: 
update based and invalidation based. In update based protocols, e.g.. Dragon and Firefly, 
whenever a shared location is written to by a processor, its value is updated in the caches 
of all other processors holding that memory block without invalidating the block. In 
contrast, with invalidation based protocols, e.g, MESI and Berkeley, on a write operation 
the memory block being written to is invalidated in all other caches [7]. In this paper, 
we model invalidation-based protocols using the framework of Initialized Broadcast 
Protocols wherein, each broadcast transition of U is either an (a) \-flush: transition 
a — > 0 is called an l-flush iff from each state c of U there is the (unique) matching 
receive c -> i, or (b) initialized-broadcast: transition ir = a — > o is an initialized 
broadcast send transition provided that a = i and every matching reception transition 
for tr is of the form c ^ d, where either both c,d^\ ox both c,d = i. 

Guarded Broadcasts. In Guarded Broadcasts, each broadcast transition tr is of either 
of the two forms (a) Flush: Given state a of U, transition b ^ c G R, where c i, 

is called an a-flush transition provided that there exists the matching receive transition 

;?? 

i —4 i in R and for each state d ^ i of [/, there is a matching receive transition of 

197 

the form d — i a in i?; a flush transition is an a-flush for some a. (b) Push: Transition 

/II 

a b, where b ^ i, is a push transition provided that there exists the matching receive 

/?? /?? /?? /?? /?? 
transitions i — i i, a — i a and b ^ bin R and for every path c — i d — i e, we have d = e. 

In either framework, given U, the state transition digram for 17" = (S'", S, i?", i"), 
the system with n copies of U, is based on interleaving semantics in the standard way. 
We write x.s G f7" to mean that finite computation path x of f7" ends in global state s. 
For local state a, num{a, s), denotes the number of copies of local state a in s. 

The template U for a protocol, such as MSI (figure 1), is obtained from its state 
transition diagram through a simple abstraction, treating the behavior of the processors 
as purely nondeterministic. The transformation is straightforward, syntactic, and me- 
chanical and tantamounts to relabeling the transitions of the given template to illustrate 
the link between broadcast sends and their matching receives. 

* Initializability is not needed for the results in section 3.1. 
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Fig. 1. The MSI Cache Coherence Protocol and its template 



Safety Properties. For cache coherence protocols, we are typically interested in pairwise 
reachability, viz., given a pair (a, b) of local states a and b of template U, deciding 
whether for some n, there exists a reachable global state of 17", with a process in each of 
the local states a and 6, viz., U" |= Vi/j EF(oi A 6^). For instance, in the case of the MSI 
protocol, we are interested in showing that none of the pairs in the set { (M, M) , (M, S) } 
is pairwise reachable. 



3 Model Checking Guarded Broadcasts for Safety Properties 

In a split-transaction bus, each transaction is split into two independent sub-transactions: 
a request transaction and a response transaction. Other transactions (or sub-transactions) 
are allowed to interfere (interleave) between them so that the bus can be used while re- 
sponse to the original request is being generated. The advantage is a more effective 
utilization of the bus. To deal with the non-atomic nature of bus transactions, extra states 
called transient states are introduced in the state transition diagram of split-transaction 
based protocols to indicate outstanding bus requests. This however makes snoopy split 
transaction bus protocols harder to reason about than their ‘non-split’ counterparts. We 
now show how to reason about guarded broadcasts, which can model all snoopy pro- 
tocols in [19] and their split transaction bus versions, using an abstract history graph 
construction. 



3.1 Protocols without Conjunctive Guards 

In this section, we consider guarded broadcasts wherein template U does not have con- 
junctive guards; but guards of the form true or \J ^(i) are permitted. This allows us to 
handle the MSI, MOESI, MESI (not the Illinois version which is handled in the next 
section), Berkeley and Nh- 1 protocols, and their split-transaction versions. 

We motivate our technique with the help of an example. Consider the computation 
X = ( I , I ) — >^ ( I , S) of the system, , comprised of two caches running the MSI protocol. 
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We exploit the observation that we can pump up the multiplicity of each of the local 
states I , S to be greater than or equal to any arbitrary number n, by firing the transition 

I PrRdW r- ... u /I PrRdi PrRdn ^ . .s 

I — b successively n times as shown (I, I ) — ... — (b, ..., b, I, ..., I). 

2n n n 

On the other hand, consider the computation y = (1,1) — (l,M). We cannot pump up 
the multiplicity of local state M, because in order for that to happen, we need to fire the 

PrWr'^ 

transition tr = I — > " M repeatedly. But a process firing tr, a flush transition, clobbers 
every other process by forcing it into its initial state. Thus we can have at most one copy 
of M in any global state. 

Definition (representative). Given template U = {S, S, R,\), and a finite computation 
x.s of 17", we define rep(x.s) to be the tuple (a, A) G S'x 2^ , where, if no flush transition 
was fired along x, then a = i and A = {s[j] \j G [1 : n]}; and if (7^ is the process to last 
fire a flush transition along x, then s[i] = a and A = {s[j] \j G [I : n] A j ^ i}. 

Then the above discussion can be formalized as the following unbounded pumping 
property implicitly shown in the proof of proposition 3.1. Let computation path x.s G (7” 
be such that rep(x.s) = (a, A). Then given a positive integer p, there exists y.t G (7™, 
for some m, such that rep{y.t) = {a, A'), where A C A' and for each a' G A', 
num{a', t) > p. Thus we can represent x.s by the tuple (a, A') G Sx2^ , representing 
a formal state with (at least) one copy of a and arbitrarily many copies of each state in A'. 
Given template U, we now define the abstract history graph, Ajj = {Sjj, R-u, (f {'})). 
as a transition diagram over tuples in S(j = Sx2^ that captures the behaviour of a system 
instance of arbitrary size. To define the transition relation TZjj^ given a tuple (a, A) and 
an internal or a broadcast send transition tr = c ^ d, we introduce the notion of the 
successor of (a, A) via tr as either the 1-successor, which covers the scenario when a 
process in local state a, that (possibly) has multiplicity one fires fr; or the 2-successor 
of {a, A), covering the scenario when a process in one of the states in 7l each of which 
can be thought of as having arbitrarily large multiplicity fires tr. 

Definition (1 -successor). Let (a, A) G S' x 2'^ and let transition tr = a^b G R labeled 
by guard g, be enabled in (a,xl), viz., if g = ^(i), then 3a' G xl : a' ^ i. Then 

5Mcci((o, A),tr) = {b, B), where if tr is an internal transition then B = A and if tr is 
a broadcast send transition then B = {6'|3a' G A : 3a' — >b' G R that is a matching 
receive for tr }. 

Definition (2-successor). Let (a, xl) G S x 2'^ and let transition tr = b^c G R, where 
b G A, he such that if tr is labeled by guard g then it is enabled in {a, A), viz., if 
g = \J ^(i), then for some a' G {a} U a' ^ i. Then, succ 2 ((a, A),tr), is defined as 
the tuple 

- (c, {c'} U {i}) if fr is a c' -flush transition 

- (a, g 1 U {c}) if fr is an internal transition. Note that since we had arbitrarily many 
copies of b to start with so even after firing internal transition tr we are guaranteed 
arbitrarily many processes in local state b which is therefore not excluded from the 
second component of the resulting tuple. 
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Fig. 2. The abstract history graph for the MSI Cache Coherence Protocol 

- (d, B) if tr is a push broadcast transition, where a — >d is the (unique) matching 
receive for tr from a and B = {c} U {6'|3a' C A : 3a' — G R that is a matching 
receive for tr }. Since we have arbitrarily many copies of 6 so in B we include 
the local state that results from firing the matching receive for tr from b which by 
definition of a push transition is b itself. 

Definition (Abstract History Graph). Given template U = {S, B, R, i), the abstract 
history graph of U, is defined as Au = {S/j, TZu, (b {'})). where Sjj = S x 2^ and 
TZu = {((a, A), (6, B))\{b,B) = succi{{a, A), fr)) or (b,B) = succ 2 {{a, A),tr)) for 
some internal or broadcast send transition tr ofU}. 

As an example, the abstract history graph for the MSI protocol is shown in figure 2. 
Self loops are omitted for the sake of simplicity. For convenience, we have labeled each 
transition of the graph by the label of the transition responsible for ‘firing’ it. We now 
establish a ‘path correspondence’ between finite computations of [/” and finite paths of 
Au starting at (i, {i}). Let (a, A) > (6, B) denote a = b and B <Z A. 

Proposition 3.1 (Covering Projection). For any n and any finite path x.s in C/", there 
exists a finite path y.t in Au starting at (i, {i}) such that t > rep{x.s). 

The tuple t not only stores the set of local states present in s, but also the states that 
could potentially be present in a global state of a system with sufficiently many copies 
of U that results by firing (a stuttering) of the same sequence of transitions as were fired 
along X to get to s. Thus t drags along some ‘history’ of computation x leading to s and 
thereby stores more information than rep(x.s). 

Proposition 3.2 (Lifting). Let x be a path of Au starting at (i, {i}) and leading to 
tuple (a, A) of Au- Then, given p > there exists y.t € C/”, for some n, such that 
rep{y.t) = (a, A) and t has at least p copies of each state in A plus a copy of a. 

Combining the previous two results, we have 

Theorem 3.3 (Decidability Result). Pair (a,b) ^ S X S is pairwise reachable iff there 
exists a path in Au starting at (i, {i}) to a tuple of the form (c, C) where either a = c 
and b G C; orb = c and a G C ; or a G C and b G C. 
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Thus we have reduced the problem of pairwise reachability for a pair of local states of a 
given template U to the problem of reachability in Au. In the worst case, the size of the 
abstract graph is 0(|C/|2l*^l), however, we need only consider the set of tuples reachable 
from (i, {i}) which, in practice, is much smaller (cf. section 6). 

Corollary 3.4. The pairwise reachability problem for a pair of local states of a given 
template U can be solved in time 0(\U\2^^'^), where |17| is the size of template U as 
measured by the number of states and transitions in U. 

3.2 Adding the Specialized Conjunctive Guard 

To reason about systems wherein the templates are augmented with the specialized 
conjunctive guard along with the assumption of initializability, we modify the abstract 
history graph by adding for every tuple (a, A) , a transition of the form (a, A) — >■ (o', {i}), 
where either a' = a or a' & A, to Au- Broadly speaking, the intuition behind the modifi- 
cation is that we can make the specialized conjunctive guard of a process evaluate to true 
starting at any global state by driving all the other processes into their respective initial 
states by making use of the initializing internal transition. Then, path correspondences 
as in section 3.1 can be shown and so, pairwise reachability can be decided in time 
0(|f7|2l^l), where \U\ is the size of U. Examples include the Illinois-MESI, Dragon and 
Eirefly protocols and their split-transaction versions. 



4 Reasoning about Invalidation Based Protocols Using Cutoffs 

In this section, we consider the PMCP for formulae of the form /\i^j and 

/\i^j E/i(i, j), where h{i, j) is a LTL\X formula over the local states of Ui and Uj. We 
show how to reduce reasoning about a system with an arbitrary number of processes 
(caches) to a system with up to a cutoff im fact 7) number of processes. This immediately 
yields a polynomial time algorithm for the PMCP at hand. The use of cutoffs has several 
advantages. First, the small system with a cutoff number of processes is identical to the 
large system, but with a fewer number of processes, and thus there is no need to construct, 
for instance, an abstract graph that may have a complex, non-obvious structure. Secondly, 
it automatically caters to error trace recovery. We later show how to reduce reasoning 
about LTL\X properties from directory-based to snoopy protocols for which these results 
can be leveraged. 

We now present the cutoff result for properties of the form Since 

all processes in the systems we consider are copies of a single template U, they are 
all isomorphic up to renaming. Therefore symmetry considerations dictate that [/" \= 
E/i(l, 2) iffforeachpairi, j, wherei ^ j,(7” \= We shall therefore concentrate 

only on the formulae Ah{l, 2) and E/i(l, 2). 

Proposition 4.1 (Cutoff Result for Finite Paths). For all n > 7, [/" |= Efjn/i(l, 2) iff 
\= Efjn/i(l, 2), where Efin quantities over finite paths only. 

Proof Sketch. We present the main ideas behind the proof. The proof of the cutoff result 
proceeds by establishing a stuttering path correspondence between (7", where n > 7, 
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and viz., constructing a finite stuttering computation path y of C/^ corresponding to 
a given finite path x of f7” that preserves the local computation paths of processes Ui 
and U 2 , modulo stuttering, and vice versa. 

(^) Given a finite computation x of [/”, where n > 7, we show how to con- 
struct a finite computation y of that preserves the local computations of processes 
Ui and U2, modulo stuttering. Towards that end, we parse (the transitions of) x as 
X = N[)lQ...IraNra+i, where li is the zth global transition to be executed along x that 
results by firing either an i-flush or a transition labeled with /\(0- Thus A^jS are strings 
of transitions whereas /^s are single transitions. The construction of y proceeds by con- 
structing for each subsequence Nili, a corresponding subsequence N[I[ by projecting 
onto the local subsequences of Nili of a set Pi of process indices defined below. 

In defining Pi, there are two main considerations (a) every projected broadcast re- 
ceive has a matching send, and (b) the specialized disjunctive guard is true for every 
projected local transition (the conjunctive guard /\(i) is automatically true for all pro- 
jected transitions). Clearly, we need to project on to process indices 1 and 2 as we have 
to preserve the local computation sequences of U\ and U2 modulo stuttering. Also, we 
need to project onto indices pa and p 4 of the processes responsible for firing the solitary 
global transitions in Ii_i and li, respectively. Projection on to index pa ensures ‘conti- 
nuity’ of the local computation of the process responsible for firing the global transition 
constituting while projection on to index p 4 guarantees that every projected receive 
transition in /' has a matching send in Finally, let Ni = Xi'...Xi'+i and let a and b 
be, respectively, the least and second least among all integers c G [0 : (] having the 
property that Xi'+c[p] ^ i, for some p G [1 : n] \ ({1, 2} U {pa} U {p 4 }). To ensure 
that the specialized disjunctive guard is true for the projected transitions, we include 
the indices pa and pg in Pi, where Xii+a[P 5 ] ^ i and Xi'+b[PQ\ ^ i- Then, we let Pi = 
{1, 2} U {pa} U {p 4 } U {ps} U {pe}- A seventh process with index py, say, is required to 
ensure that in N[, every projected initialized broadcast receive transition has a matching 
broadcast send. Since, by definition, an initialized broadcast send is fired only from the 
initial state, we use this process, which we (try to) maintain in its initial state i, to fire 
the required send transition and then ‘recycle’ it by firing the initializing internal tran- 
sition to make it transit back to i. The computation y, then results by ‘sewing’ up the 
subsequences N[l[ appropriately, in the same relative order as the original subsequences 
Nili along X. Note that the sets Pi may be different for different i\ however, since all 
processes in our system are isomorphic up to renaming, for each i, [/^ can mimic the 
local sub-computations of N[I'^. 

(■<=) The lifting part is simpler. Given a computation y of U’^, we can construct a 
valid computation x of C/”, where n > 7, by letting processes U\, execute exactly 
the same local computations as in y while the rest of the processes just stutter in their 
initial states without executing any non-receive transition at all (all receives from i loop 
back to i). □ 

The proof technique of proposition 4.1, extends to the case where we consider full 
paths (and full paths under the assumption of unconditional fairness). We then have the 
following. 

Proposition 4.2 (Cutoff Result for Full Paths). For all n > 7, \= E/i(l,2) iff 

[/^ 1= 2), where h{i, j) is a LTL\X formula over processes Ui and Uj. 
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As a corollary to propositions 4. 1 and 4.2, we have the following. 

Proposition 4.3 (Efficient Decidability Resnlt). For initialized broadcast protocols, the 
PMCP for formulae of the types j), Ai^j ^Khj) and 

Ai^j ^h(i,j) is decidable in polynomial time in the size of the template U specifying 
the parameterized family. 

5 Reducing PMCP for Directory Based to Snoopy Protocols 

In this section, we present a methodology for reducing the PMCP for (stuttering insensi- 
tive) LTL\X properties for directory based to snoopy cache protocols thereby enabling 
us to leverage the techniques developed for snoopy protocols. We exploit the observation 
that with most directory based protocols one can associate a snoopy protocol with ex- 
actly the same local states [7] and executing essentially the same protocol except that the 
implementation of each snoopy broadcast transition is broken down into several smaller 
steps that execute asynchronously. We call such transitions distributed broadcasts. The 
interleavings of the steps of different distributed broadcasts makes directory based pro- 
tocols behaviorally more complex than their snoopy counterparts and thus seemingly 
harder to reason about. However, the central directory can service only one distributed 
broadcast at a time, and so in a given computation, x, of the system, [^Directory’ com- 
prised of n caches running the directory based protocol Directory, there is a unique 
serial order on the way distributed broadcasts are serviced along x. This allows us to 
construct a computation y of U^„oop’ where Snoop is the snoopy protocol corresponding 
to Directory, by letting the snoopy broadcast transitions fire in the same linear order as 
their distributed counterparts were serviced along x. This path correspondence allows 
us to reduce reasoning about linear time properties from directory based to snoop based 
protocols. We demonstrate our technique using a directory based protocol suggested by 
German [17], which we denote by DIR. 

Reasoning about the DIR Directory Based Protocol. In the DIR protocol, each cache 
is represented as a client process with the directory being represented as the Home 
process. The variables used in DIR are given below. 

type message = {empty, reqshared, req-exclusive, invalidate, 
invalidate mck, grantshared, gr ant _ex elusive} 
type cache_state = {invalid, shared, exclusive} 
chcuinell, channel2_4, channels : array[I:n] of message 
home_sharer_list, home_invalidate_list; array[l:n] of boolean 
home_exclusive_granted : boolean 
home_current_command: message 
home_current_client: [l:n] 
cache: array[l:n] of cache_state 

Each client has three possible local states, viz., invalid, shared and exclusive, rep- 
resented by the variable cache.state. Communication between client[i], the process 
representing the ith cache, and Home, the process representing the directory, takes 
place via the following variables that are shared pairwise between client[i] and Home. 
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3: (h_c_co = empty A -i(chl[ci] = empty)) 

— >■ h_c_co := chl[cZ]; chl[cZ] := empty; h_c_cl := cl; 
fori : [1 : n] doho_in_l[i] := ho_sh_l[i] endfor; 

4: ((h_c_co = reqshared A heg V h_c_co = req-exclusive) A h_in_l[cZ]A 
ch2_4[cZ] = empty) 

— > ch2_4[cZ] := iniiaZidate; h_in_l[cZ] := false 
5: (-'(h_c_co = empty) A ch3[d] = invalidate .ack) 

— > h_sh_l[d] := false; ch3[cZ] := empty; heg := false; 

9: h_c_co = req.shared A “■heg Ach2_4[h_c_cl] = empty 
— >■ h_sh_l[/i_c_cZ] := true;h._c_co := empty; ch2_4[h_c_cl] := grant_shared; 
10: h_c_co = req_exclusive A Ai(h_sh_l[i] = false) Ach2_4[ho_c_cl] = empty 
— > h_sh_l[ho_c_cl] true; h_c_co := empty; heg := true; 

ch2_4[h_c_cl] grant^exclusive; 



Fig. 3. Transitions for Home (Directory) 



1: (cache[cZ] = invalids chl[cZ] = empty) — >■ chl[cZ] := reqshared 
2: ((cache[cZ] = invalids cache[cZ] = shared) A chl[cZ] = empty) 

— >• chi [cZ] := req-exclusive 
6: (ch2_4[cZ] = invalidate A ch3[cZ] = empty) 

—> ch2_4[cZ] := empty; ch3[cZ] := invalidate-ack; cache[cZ] := invalid; 

7: (ch2_4[cZ] = grant^shared) — >■ cache[cZ] := shared; ch2_4[cZ] := empty; 

8: ch2_4[cZ] = grant_exclusive — >■ cache [cZ] := exclusive; ch2_4[cZ] := empty; 



Fig. 4. Transitions for Client (Cache) 



- channell[i\: used by client[i] to request the memory block in the shared or the 
exclusive state. 

- channel2A[i] \ used by Home to send the invalidation message or grant (shared or 
exclusive) access to the memory block request by client[i]. 

- channel3[i\: used by client[i] to send acknowledgements for invalidation requests 
by Home. 

Clients cannot communicate amongst themselves. The transitions for Home and client 
processes are given in the guarded command format in figures 3 and 4, respectively. 
We abbeviate home_current_client, home_current_command, home_sharer_list, 
home_invalidate_list and home_exclusive_granted as h_c_cl, h_c_co, h_sh_l, 
h_in_l and heg, respectively, and the communication channels chcinnell, channel2_4 
and channels as chi, ch2_4 and chS, respectively. 

We now show how to reduce verification of DIR to that of the ESI snoopy protocol, 
defined below. 

The ESI Snoopy Cache Protocol. The template for the ESI protocol is defined as 
U = ({I, S, E}, {PrRd, PrWr}, R, I), where the transition relation R consists of the 

broadcast send transition I — > " S with the matching receives E — !>' ' I, S — > ' S 
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PrFtd^’^ PrWr'^ _ 

and I — >' ' I; and the I -flush broadcast I — > " M. The symbols E, S and I denote, 
respectively, the exclusive, shared and invalid states. 

Establishing the Stuttering Path Correspondence. Let [/q|r represent a system with n 
clients running the directory based protocol DIR. We begin by showing how the variables 
used in the DIR protocol impose a relative ordering on the execution of the transitions of 
the protocol. For transitions (numbered) j, fc, we say that j pre-empts k, denoted hy jPk, 
to denote the fact that along any global computation of 17 qir, between any two brings of k 
(possibly by different clients), there must be at least one bring of j. We write (j + k)Pm 
to mean that either jPm or kPm, and joP...Pjk to mean that for all ( G [1 : k], 
For transition j andindexi G [1 : n], we write ji to indicate that the execution 
of transition j modifies the local variables of Ui, the process representing the ith client, 
or the communication variables, channell[z], chcuinel2_4[f] and channel3[f], shared 
pairwise between Ui and Home. 

We first show that (9 + 10)P3. Note that variable home_current_command must be 
set to empty for transition 3 to be enabled and that can be done only by bring transitions 
9 or 10. Thus one of 9 or 10 has to be bred for 3 to be flred (except for the flrst time). 
Also, every time 3 is flred it sets home_current_command to a non-empty value thus 
disabling itself and so again one of 9 or 10 has to be flred for 3 to Are again. Similarly, 
we may show that 3P(9 + 10) (via h.ome_current_command) and for i G [1 : n], 

(via channel2_4), 6iP5i (via channels) and 3P4i (via home_invalidate_list). 

Let a; be a global computation of Poir. Since 3P(9 + 10)P3, therefore we have the 
crucial observation that along x, the firing of 3 alternates with the firing of eifher 9 or 10. 
Note that firing transitions 9 or 10 sets the value of home_current_command to empty 
thus disabling transition 4. Thus along x, the firing of transition 4 is always sandwiched 
between the firings of 3 and one of 9 or 10. Consider a firing of along x. Then fhe value 
of home_current_command during the last firing of 3 along x is either req-exclusive 
or reqshared. If the value is req-exclusive, then since 4^ has been flred therefore 
after firing the last 3, home_invalidate_list[j] = true = home_sharer_list[j], 
and thus transition 5 (the only transition to change the value of home_sharer_list[j] 
to false) has to be flred for 10 to be enabled to Are again. Also, since 4jP6jP5j, we have 
that the firing of transitions 4j, 6j and 5j is sandwiched between the firing of 3 and 10. 
If fhe value of home_current_command is reqshared, fhen we can similarly show that 
the firing of transition 4j, 6j and is again sanwiched between the firings of 3 and 9. 
Note that the flrst scenario, viz., the firing of 3; followed by the firing of the triplet 4j, 
6j and 5j for appropriate indices j G [1 : n]; followed by Home firing 10 corresponds 
fo the firing of the snoopy broadcast of £5/ labeled with PrWr in a distributed fashion. 
Analogously, the second scenario, viz., the firing of 3; followed by fhe firing of 4j, 6j 
and 5j for appropriafe j; followed by Home firing 9 corresponds to the firing of fhe 
snoopy broadcast of ESI labeled with PrRd. 

The distributed versions of the snoopy broadcasts labeled with PrRd and PrWr are 
denoted by d-PrRd and d-PrWr, respectively. Thus the firing of all except the flrst and 
last steps, viz., 1, 2, 7 and 8, of each distributed broadcast are sandwiched between the 
firing of fransifions 3 and (9-tl0). We call fhese transitions, including transitions 3 and 
(9 h- 10), the body of the distributed transition. The crucial observation is that the bodies 
of different distributed transitions do not overlap as once 3 is executed by a process, one 
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of 9 or 10 has to be executed by the same process for 3 to be executed again possibly, 
by a different process, to begin executing the body of another transition. Thus given 
computation x of we can arrange all the distributed broadcast transitions fired 
along X in a sequence d-tro, d-tri,... based on the order in which their bodies were 
executed. We say that a distributed transition d-tr is fired by process Ut of [/qi^ iff the 
entry transition of d-tr sets the value of home_current_client to k. Let transition d-trj 
be fired by process Ui^ of C/qir. Let y be the computation sequence of that results 
by firing the snoopy broadcasts tro, tri, ... in the order listed with transition trj being 
fired by process Ui^ of C/^gi . Conversely, given a computation path y of C/^gi , we can 
construct a computation path x of C/p|pj by replacing the firing of each snoopy broadcast 
trj by process Ui^ of C/^gi by the firing of all steps of d-trj successively back to back 
by process Ui^ of Lqir. This establishes the desired path correspondence. 

For the DIR protocol, we are required to verify that in any global state u of C/qir, 
(u[l] ^ u[2] A m[1] = exclusive) up] = invalid. Towards that end, it suffices 
to check the following: Vn : C/pip^ |= ^EF(oi A 62 ), where (a,b) G {{exclusive, 
exclusive), {exclusive, shared)}, viz., none of the pairs {exclusive, exclusive), 
{exclusive, shared) of the DIR protocol is pairwise reachable. The next result reduces 
reasoning about pairwise reachability for the DIR to the ESI protocol. 

Proposition 5.1 (Reduction for Safety). For a,b ^ invalid, p EF(ai A 62 ) iff 

C/^Si hEF(aiA 62 ). 

Thus it suffices to check that none of the pairs (E, E), (E, S) is pairwise reachable 
for the ESI protocol. This took 0.01 secs using the abstract history graph technique, 
and 0.02 secs using the cutoff technique. 

The above technique of establishing stuttering path correspondences also works, in 
general, for LTL\X formulae. In [4], it was shown that the property A(G(channell[l] 
= request_shared ^ F(channel2_4[l] = grant _shared))), viz, once a block is re- 
quested in the shared state by a cache then it is eventually granted shared access, fails. 
However, if we assume unconditional fairness, viz., every process fires infinitely often, 
then the property holds. We now modify the ESI protocol by introducing the inter- 
mediate local states rS and rE, standing for request-shared andrequest-cxclusive, 
respectively. Before executing a broadcast send to the exclusive (shared) state, we first 
transit via an internal transition to rE (rS) and then fire the broadcast send labeled with 
PrWrW (PrRdll) to transit to the exclusive (shared) state. Then the above liveness 
property can be reduced to the PMCP for A(G(rS ^ F(S))) for the modified ESI. This 
property has a cutoff of 7 and was verified to hold under assumption of unconditional 
fairness in 0.02 secs^. Note that the property fails if we do not assume fairness. In that 
case an error trace is automatically generated for the 7 process instance. No manual 
effort as in [4] is required to validate the erroneous path in the abstraction, an advantage 
of using cutoffs. 



^ Technically, we verify the LTL\X expressible assertion /ai> ^ G(rS ^ F(S)). 
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6 Applications and Experimental Results 

We consider PMCP for all the snoop based cache protocols presented in [19] (MSI, 
MESl, Illinois-MESI, MOESI, Berkeley, Synapse N+1, Dragon, Eirefly) and the split- 
transaction version of the MESI protocol. Using the abstract history graph, each of 
the above protocols was verified in at most 0.01 secs. Although in the worst case the 
number of reachable abstract states in the modified abstract history graph for template 
U = {S, R, U, i) could be as large as |S'|2l‘®l , in practice it typically turns out to be much 
smaller. Eor instance in the MESI protocol, the number of reachable abstract states was 
6, against a worst case possibility of 4 X 2^ = 64 states. In conclusion, the abstract 
history graph construction seems to work well in practice. In fact, it seems to work 
even better than the polynomial time cutoff method which too is very efficient requiring 
only a fraction of a second to verify each invalidation based protocol. This, however, 
may be due to the fact that whereas the abstract history graph was built directly from 
the description of the protocol using a separately written code, for the cutoff method 
we used SMV, possibly resulting in extra overheads from compilation of the protocol 
specifications, building BDDs etc. The experiments were carried out on a machine with 
a 797MHz Intel Pentium III processor and 256 Mb RAM. 



Protocol 


Abstract History Graph | 


Cutoff Method | 


# of Abstract States 


user time (secs.) 


Total # of BDD Nodes 


user time (secs) 


MSI 


5 


< 0.01 


7913 


0.02 


MESI 


6 


< 0.01 


8287 


0.02 


Illinois 


6 


< 0.01 


7711 


0.02 


MOESI 


7 


< 0.01 


10284 


0.04 


N-hl 


5 


< 0.01 


7913 


0.02 


Berkeley 


5 


< 0.01 


7689 


0.03 


Eirefly 


6 


< 0.01 


NA 


NA 


Dragon 


8 


< 0.01 


NA 


NA 


Split MESI 


82 


< 0.01 


NA 


NA 



7 Concluding Remarks 

The generally undecidable PMCP has received a good deal of attention in the literature. A 
number of interesting proposals have been put forth, and successfully applied to certain 
examples (e.g, [2,3,5,20]). Most of these works, however, suffer from the drawbacks of 
being either only partially automated or being sound but not guaranteed complete. Much 
human ingenuity may be required to develop, e.g., network invariants; the method may 
not terminate; the complexity may be intractably high; and the underlying abstraction 
may only be conservative, rather than exact.^ 

Similar limitations apply to prior work on PMCP for cache protocols. Some concrete 
examples of verification of cache protocols can be found in [6,22]. Pong and Dubois 
[24] described general methods that were sound but not complete, as they were based on 

^ However for frameworks that handle specialized domains, sound and complete, fully automatic 
and, in some cases, efficient decision procedures can be given ([9,10,13,15,23]). 
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conservative, inexact abstractions. In [16], it was shown that the PMCP for safety over 
broadcast protocols [14] is decidable using the general backward reachability procedure 
of [1]. In [21], Maidl, using a proof tree based construction, shows decidability of the 
PMCP for a broad class of systems including broadcast protocols, but the decision 
procedure is not known to be primitive recursive. Moreover [14,16,21] do not report 
experimental results for cache protocols. In [8], Delzanno uses arithmetical constraints 
to model global states of systems with many identical caches. His method uses invariant 
checking via backward reachability analysis of [1] and provides a broad framework for 
reasoning about cache coherence protocols but his procedure does not terminate on some 
examples. More recently, a decision procedure based on a modification of the backward 
reachability algorithm that guarantees termination for all snoopy cache protocols has 
been given in [12]. However, the backward reachability algorithm of [1] that [8,12,16], 
make use of, although general, suffers from the handicap that the best known bound 
for its running time is not known to be primitive recursive. Furthermore, this technique 
does not provide a way to generate error traces when a bug is detected. An elegant 
cutoff method that can verify the DIR protocol was given in [23], but it was sound and 
not complete and worked only for safety properties. Also in [4], a broad technique was 
proposed for the verification of WSIS systems that can handle the DIR protocol as an 
example, but again the resulting technique was sound but not complete. 

In this paper, we made three distinct contributions to the parameterized model check- 
ing of cache coherence protocols. 

First, to reason about general snoopy broadcast protocols, we introduced the frame- 
work of Guarded Broadcast Protocols. It is both a generalization and a significant 
simplification of ordered broadcast protocols [11] which required identification of a 
pre-order on the set of local states of the protocol. The extra transient states found in 
split-transaction bus protocols prevent the imposition of the necessary pre-order. Our 
new guarded protocol framework eliminates the need to impose a pre-order on protocol 
states and thereby caters readily for split transactions. This framework is broadly appli- 
cable, handling safety properties, and catering for all 8 snoopy protocols in Handy [19], 
even in their split transaction formulations. 

Second, we presented the framework of Initialized Broadcast Protocols, establish- 
ing provably efficient reasoning about safety and liveness of invalidation based snoopy 
protocols. We showed that a system with an arbitrary number of caches could be reduced 
to a system with at most 7 caches. This yields a fully automatic and provably efficient 
polynomial time algorithm for verifying parameterized invalidation based snoopy cache 
protocols. Cutoffs have the added important advantage that the small system with 7 
caches is a precise replica of large system with n caches, up to size. This not only makes 
the reduction simple but also caters automatically for error recovery as there is an error 
in a large system iff there is one in the system with the cutoff number of processes. 

Third and last, we described a method for reducing parameterized reasoning about 
directory based protocols to reasoning about snoopy protocols. We have illustrated the 
method using the DIR directory based protocol as an example. We then leverage the 
above cutoff and abstract history graph techniques developed for snoopy protocols to 
reason about linear time properties of parameterized directory based protocols, which 
typically are much harder to reason about, in an exact fashion. 




262 



E.A. Emerson and V. Kahlon 



References 

1. P. Abdulla, K. Cerans, B. Jonsson, Y. K. Tsay. General Decidability Theorems for Infinite 
State Systems. LICS. 1996. 

2. P. Abdulla, A. Boujjani, B. Jonsson and M. Nilsson. Handling global conditions in parame- 
terized systems verification. CAV 1999. 

3. P. Abdulla and B. Jonsson. On the existence of network invariants for verifying parameterized 
systems. In Correct System Design- Recent Insights and Advances, 1710, LNCS, pp. ISO- 
197, 1999. 

4. K. Baukus, Y. Lakhnech, K. Stahl. ParameterizedVerification of a Cache Coherence Protocols: 
Safety and Liveness, VMCAI 2002, LNCS 2294, pages 317-330. 

5. M.C. Browne, E.M. Clarke and O. Grumherg. Reasoning about Networks with Many Identical 
Einite State Processes. Information and Control, 81(1), pages 13-31, April 1989. 

6. E.M. Clarke, O. Grumberg, H. Hirashi, S. Jha, D. E. Long, K. L. McMillan and L. A. Ness. 
Verification of the Euturehus-tcache coherence protocol. In Proc. Ilth Int. Symp. on Computer 
Hardware Description Languages and their Applications, 1993. 

7. D. E. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach. 
Morgan Kaufmann Publishers, 1998. 

8. G. Delzanno. Automatic Verification of Parameterized Cache Coherence Protocols. CAV 
2000,51-68. 

9. E.A. Emerson and V. Kahlon. Reducing Model Checking of the Many to the Eew. CADE 

2000 . 

10. E.A. Emerson and V. Kahlon. Model Checking Large-Scale and Parameterized Resource 
Allocation Systems. TACAS 2002. 

11. E.A. Emerson and V. Kahlon. Rapid Parameterized Model Checking of Snoopy Cache Pro- 
tocols. TACAS 2003. 

12. E.A. Emerson and V. Kahlon. Model Checking Guarded Protocols. LICS 2003. 

13. E.A. Emerson and K.S. Namjoshi. Reasoning about Rings. POPL. pages 85-94, 1995. 

14. E.A. Emerson and K.S. Namjoshi. On Model Checking for Non-Deterministic Infinite-State 
Systems. LICS 1998. 

15. E.A. Emerson and K.S. Namjoshi. Automatic Verification of Parameterized Synchronous 
Systems. CAV. LNCS , Springer- Verlag, 1996. 

16. J. Esparza, A Einkel and R. Mayr, On the Verification of Broadcast Protocols. LICS 1999. 

17. S.M. German. Private communication. 

18. S.M. German and A.P Sistla. Reasoning about Systems with Many Processes. J. ACM, 39(3), 
July 1992. 

19. J. Handy. The Cache Memory Book. Academic Press, 1993. 

20. R. P. Kurshan and K. L. McMillan. A Structural Induction Theorem for Processes. PODC. 
pages 239-247, 1989. 

21. M. Maidl. A Unifying Model Checking Approach for Safety Properties of Parameterized 
Systems. CAV 2001. 

22. K. McMillan and J. Schwalhe. Eormal Verification of the Gigamax Cache Consistency Pro- 
tocol. In Proc. Int. Symp. on Shared Memory Multiprocessors, pp 242-251, 1991. 

23. A. Pnueli, S. Ruah and L. Zuck. Automatic Deductive Verification with Invisible Invariants. 
TACAS 2001, LNCS, 2001. 

24. E. Pong and M. Dubois. A New Approach for the Verification of Cache Coherence Protocols. 
IEEE Transactions on Parallel and Distributed Systems, Vol. 6, No. 8, August 1995. 




Design and Implementation of an Abstract 
Interpreter for VHDL 



Charles Hymans 

STIX, Ecole Polytechnique, 91128 Palaiseau, France 
Charles . hymansSpolytechnique . f r 



Abstract. We describe the design by abstract interpretation of a static 
analysis for the popular hardware language VHDL. From a VHDL de- 
scription, the analysis computes a superset of the states reachable during 
any simulation run. This information is useful in the validation of safety 
properties of hardware components. The construction of the analysis is 
based on the formal definition of a semantics for VHDL. Soundness with 
respect to this semantics is shown. Various techniques allow a compro- 
mise between the desired accuracy and the cost of the final algorithm. We 
present a few examples and detail the essential implementation choices. 



1 Introduction 

We present the design of a static analysis for VHDL. It computes a superset 
of the states that may be encountered during any simulation run of a descrip- 
tion. Following the methodology of abstract interpretation [2], we first define 
the semantics of a subset of VHDL. A sound static analysis is then obtained 
from this formalization by abstraction. We make our construction generic in the 
underlying symbolic domain used to represent the possible values that signals 
may take. That way, it is possible to plug in various back-ends so as to attain 
the best compromise between precision and efficiency. This work extends [5]. 
Arrays, variables, for-loops and until clause in wait statements were previously 
not considered. A finer abstraction of the state-space, which keeps track of the 
history of computation, is proposed. All implementation details are new. 



Motivating example. We consider a component which performs the multi- 
plication of an input matrix by a constant matrix. The input matrix is fed one 
coefficient at a time through a wire DI on rising edges of the clock CLK. New 
coefficients are signaled by setting a flag DSI high and need not be given in con- 
secutive cycles. Similarly, the result is produced on DO while the flag DSO is set. 
We write a test-bench made up of the input generator of Fig. 1 and the checker of 
Fig. 2. The generator stimulates the design to do the multiplication of a unique 
matrix INPUT. It does this an unbounded number of times and waits arbitrarily 
long between each coefficient. The checker simply asserts the values read on DO 
when DSO is high are the correct results of the multiplication. Our prototype 
implementation is able to determine, without any human intervention, that the 
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initial INPUT := (1,1, 0,1); 
process 

for I in 0 to 3 loop 
wait on CLK until CLK; 

DSI <= FALSE; 
while random loop 

wait on CLK until CLK; 
end loop; 

DSI <= TRUE; DI <= INPUT(I); 
end loop; 
end process; 

Fig. 1 . Input driver 

M ::= Pil ... |P„ 

P ::= C;P \ e 
C ::= V := e 
I s <= e 
I a(ei) := 62 
I wait on W until b for t 
I while b do P end 
I if b then P end 



initial RESULT := (-4,17,-9,10); 
process 

for J in 0 to 3 loop 
wait on CLK until CLK; 
while (not DSO) loop 
wait on CLK until CLK; 
end loop; 

assert DO = RESULT(J); 
end loop; 
end process; 

Fig. 2. Output Checker 

(Parallel composition) 
(Sequence) 

(Variable assignment) 
(Signal assignment) 
(Array assignment) 
(Suspension) 

(Iteration) 

(Selection) 



e, b ::= i | true | false | random | v | s | a(e) 

I not b I 61 and &2 | bi or 62 | &i = 62 | ei < 62 
I ei + 62 I ei * ei 



where v is a variable, s a signal and a an array identifier; IF is a possibly empty set of 
signals; t is a strictly positive integer or 00; and i is an integer. 

Fig. 3. Syntax 



assertion DO = RESULT (J) is never broken. Note that this is not practicable by 
conventional simulation. 



2 An Operational Semantics for VHDL 

To be able to reason about VHDL descriptions, we first formally define their 
semantics. Formalizations close to ours can be found in [3,4,6]. We suppose an 
elaboration phase - similar to the one presented in the standard [1] - compiles 
the description into a program of the kernel language of Fig. 3. Programs manip- 
ulate integers, booleans and statically allocated arrays. Note we deliberately ban 
delayed signal assignments (signal assignments with an after clause). They do 
not appear in the designs we wish to validate, and add much complexity since, 
in their presence, the precise layout of the memory used by a program is not 
known statically. 

We express the execution of a program P as a small-step operational seman- 
tics. Program statements ‘C are uniquely tagged with labels I that are taken 
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L. 



p\- e ■ 



V 



sig “ 

(l,p) {next{l),p[v <— w]) (i,p) ^ {next{l),p[s <— w]) 

*wait on W until b for t c = {next{l),W,b,t) 



s <= e 



p\- e 



V 



suspend - 



(l,p) (c,p) 



enter - 



*while b do ^C;P end 
p\- b true 
{l,p)^{l',p) 



exit- 



while 
p\- b ■ 



do P end 
false 



(l,p) (next{l),p) 



Fig. 4. Sequential execution 
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k iiwake{Wi,bi,p,p') 
Ci otherwise 





(c,p) (c',pO 
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p = update{p) 


Vj : -iwake{Wj,bj, p, p') 
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Ui 

{li,Wi,bi., 

[ci 


ii ti = t 

,ti — t) if ti ^ 00 
otherwise 



(c,p) (c',pO 



Fig. 5. Simulation algorithm 



from a set C. The label of the unique statement which follows in the con- 
trol flow graph of the enclosing process is fetched with next{l). The point of 
execution in a process is determined by the label of the statement that is to be 
executed next. The control point of a suspended process is augmented with a 
list of signals W, a condition b and a duration t. The duration is either a strictly 
positive integer or oo to indicate the absence of a timeout. A global environment 
p stores values of variables and signals. We denote by x the location where the 
future value of a signal x lies. We impose the syntactic restriction that no signal 
is assigned by more than one process. Hence, it is sufficient to remember only 
one future value for every signal. 

An expression e evaluates to a value v in an environment p, which we express 
by the judgment p\~ e v. The meaning of expressions is defined by structural 
induction in the classical way. Figure 4 shows the sequential execution of an 
individual process. Paraphrasing the sig rule: the right-hand side expression is 
evaluated in the current environment; the resulting value is then scheduled for 
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the next cycle at location x] and control is transferred to the next statement. 
The three rules of Fig. 5 are enough to completely characterize the simulation 
algorithm of VHDL. Processes are run concurrently as long as possible thanks 
to the first rule. Once all processes are suspended, the global environment is 
updated so that signal assignments encountered during the last simulation cycle 
take effect: 



^ ^ \ J if a; is a signal, 

update[p)[x) = < , , 

I p[x) otherwise. 

The A rule reactivates any process for which the value of some signal in the 
sensitivity list W was changed during the last cycle, and the condition b is met: 

wake{W, b, p, p') = (3a; S W : p{x) yf p' {x)) /\{p' \~ b true) . 

Finally, if no process activity can be resumed by A then the final rule advances 
simulation time by the smallest timeout. 



3 The Abstract Interpreter 

The set O of all prefixes of execution traces from some initial state sq can be 
constructively expressed as the least fixpoint of the continuous operator : 

F(X) = {so} U (so . . . SfcSfc+i I 3so . . . Sfc G X : Sfc Sfc+ij . 

This fixpoint is not effectively computable or even finitely representable. So 
we adopt the methodology of abstract interpretation [2] to obtain a decidable 
approximation. We proceed in two steps. 



Generic Abstract Domain. We build an abstract domain to encode sets of 
traces. We collect environments and group them according to the history of 
computations that led to their creation. Collections of environments are fur- 
ther abstracted thanks to an abstract numerical domain M. Numerical domains 
provide finite descriptions for sets of tuples of scalar values. We call the 
concretization function on the numerical domain. The way environments are 
grouped depends on a function k which creates a token h from an execution 
trace. Formally, a collection of abstract environments X represents the traces: 

7(A) = (so . . . Sfc I ft, = k(so . . . Sfc) a (c, p) = Sfc a i? = A(c, ft) a p G 7n(^)I • 

Both the numerical domain M and the grouping function k are left as parameters 
of our construction. Hence, we have two orthogonal means to adjust the precision 
and efficiency of our analyzer. 
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fv := = {(nea;t(/), assigriv<-e(-R))} 

fs <= = {(nea;t(Z), assign#s<-e(-R))} 

fa(ei) := e 2 ]‘*-R = {(nea;t(0, assigriaCeiX-ea (^))} 
fwait on W until b for = {{c, R) \ c= {next{l),W,b,t))} 

|*while b do ^C',P endJ^T? = {(Z^, select(,(-R)), (nea;t(Z), selectnot 6(7?))} 
fif b then ^C',P end]**i? = {(Z\ select;,}??)), (nea;t(Z), selectnot 6(7?))} 
where #s is a new expression to reference the future value of a signal: p h #s p(s). 
Fig. 6. Equations for the abstract sequential execution 
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Fig. 7. Abstract simulation semantics 



Abstract Semantic Transformer. We systematically derive from its concrete 
counterpart an abstract simulation algorithm (see Fig. 6 and 7). The transition 
relation mimics in the abstract domain the concrete execution of processes. It 
is expressed in terms of a few primitives that operate on the numerical domain: 
assign undertakes assignments, select asserts boolean conditions, singleton builds 
the representation of a unique environment. Each of these operations must obey 
a soundness condition. For instance select must be such that: 

{p £ R \ p\- b ==A true} C y_A/ (selects (7?)) • 

Finally, our algorithm consists in computing the least fixpoint of the following 
monotonic function: 

F“(A)(c',h') = Xo{c',h') U |J{7?' | 3(c,/i) :R = X{c,h) A (c,/i,7?) {c,h',R')} . 
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The static analysis is correct. Indeed, thanks to the properties enforced on the 
basic numerical operators, one can prove that we have: 

O C 7(lfp F#) . 



Implementation We implemented the abstract interpreter in OCaml. Exe- 
cutions that went through distinct branches of if-statements are distinguished 
and for- loops are unrolled. For the back-end, we chose the domain of constants 
which we encode with balanced binary trees. The major advantage is to improve 
sharing, which in turn speeds up many operations. All abstract environments 
computed during the analysis are placed in a hashtable. It is not necessary to 
keep them all in memory, rather we store only the ones at the entry point of 
loops. Once the fixpoint has been reached, we can rebuild the missing environ- 
ments in a single last pass. This dramatically reduces memory consumption. The 
fixpoint is computed with a standard worklist algorithm. The analysis was able 
to automatically verify various instances of the introductory example. 

4 Conclusion 

We have shown the staged design of an abstract interpreter for a subset of VHDL. 
It is based on a formalization of the simulation algorithm. As such, it has the 
ability to handle non-synthesizable descriptions. This permits its early integra- 
tion in the design cycle. With a first implementation, we successfully verified 
non-trivial properties on a VHDL component. We hope to have demonstrated 
the adequacy of the approach as an automatic means to validate fairly complex 
safety properties. We were careful to separate concerns as much as possible so 
that our analyzer can be easily improved by local modifications. In fact, we can 
now focus on more efficient numerical domains tailored to prove specific classes 
of properties. We need no longer concern ourselves with the idiosyncrasies of the 
VHDL dialect. 



Acknowledgments. We are grateful to P. Cousot, R. Cousot, F. Logozzo, X. 
Rival and E. Upton for help, comments and discussions. 
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Abstract. We outline a programming language based analysis of forwarding. 
Abstractions of processor behaviour are modelled as operational semantics for a 
language which captures the hardware resources for forwarding explicitly. Unsafe 
usage of the forwarding mechanism is eliminated by static semantics. These type 
systems may be linked to static program analysis frameworks but also characterise 
the instruction stream entering the datapath from other processor components. 



1 Introduction 

The forwarding (register-bypassing) of operands is a technique implemented in many 
modern microprocessors [7] [5]. The formal correctness of forwarding mechanisms such 
as Tomasulo’s algorithm [16], and their interaction with other elements of processor ar- 
chitecture have been widely studied [1][14] [10]. These verification efforts follow the 
well-known approach of relating processor implementations to the instruction set archi- 
tecture [4] [9], using model checking and theorem proving. In this paper, we present a 
more conceptual analysis of forwarding using an abstract model of computation com- 
prising named operand queues, registers and functional units. We demonstrate that using 
programming language notation yields an analysis which separates structural and imple- 
mentational aspects of forwarding. As a consequence, we may reason about constraints 
on the allocation of operand queues to operands which are imposed by functionality 
considerations without committing ourselves to a particular allocation algorithm. 

Elements of a Programming Language Based Approach. Our approach is program- 
ming language based as it builds on the methodology of modern programming language 
design, in particular on the separation into static and dynamic semantics. We use 

- syntax for reflecting architectural entities or the fine-grained sfructure of insfructions. 
We present a language in which the forwarding resources are explicit: the names of 
operand queues appear in the syntax of instructions, much like those of registers. 

- structural operational semantics (SOS, [13]) for defining processor behaviour. By 
giving a language several operational semantics one may compare processor be- 
haviour at various levels of abstraction. In this paper, we outline a processor model 
for sequential execution (similar to the ISA), while referring the reader to [3] for 
models for distributed (out-of-order) execution and execution with finite operand 
queues. 
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- static semantics for formally expressing properties of the instruction stream which 
cannot easily be captured syntactically. In particular we employ type systems based 
on linear logic to characterise properties of the allocation of operand queues. 

As program properties (static semantics) and processor behaviour (dynamic semantics) 
are tied together by syntax, proof techniques which are guided by the syntactic structure 
may be used for reasoning about properties of instruction streams in a particular processor 
model. The allocation of operand queues is thus validated according to the slogan well 
typed programs can ’t go wrong: an instruction stream which has been accepted by the 
static semantics will not to experience specific runtime hazards. The main such technique 
is structural induction, where the proof of a property of a certain phrase relies on related 
properties of syntactically constituent phrases. 

Static semantics also allows system properties to be related to compile-time analysis. 
A system designer may hence explore whether properties of the instruction stream can 
better be ensured by the hardware or the compiler. For example, the decision whether 
a value is forwarded is often made in the control unit. Based on the static semantics, 
we present an alternative analysis using dataflow equations. This allows us to study the 
amount of forwardable values in application programs, but also indicates the type of 
analysis a hardware implementation must perform in order to exploit these forwarding 
opportunities. 



Related Work. To our knowledge, no structural analysis of forwarding has been pub- 
lished yet, despite the plethora of verification exercises of processors with register by- 
passing. 

The application of flat term rewriting systems (TRS) for describing and relating pro- 
cessor models is advocated in [2]. This formalism captures the structure of the processor 
in a similar way as our approach, and aspects of our computational model may be seen as 
a refinement of [2]’s substitution-based communication of operands. However, the op- 
erational model is not complemented by static semantics, hence program and processor 
may not be treated in combination and no link to compile time analysis may be made. 
On the other hand, [8] reports how hardware implementations may be directly generated 
from the TRS descriptions. We have not attempted this task but are confident that one 
could develop a corresponding translation from SOS-based descriptions. 

Mountjoy et. al. [11] present a SOS description of transport-triggered architectures 
where the structure of the semantics follows the structure of the architecture. A single 
dynamic semantics is given which models the (synchronous) execution of a family of 
move-instructions in two phases. The authors observe that the legality of code relies 
on the ability of the compiler to structure code in a way which avoids output-conflicts. 
While static semantics is mentioned as a means to enforce this property, no details are 
given in [1 1] and the topic was apparently not pursued any further. 

This paper represents a brief summary of the author’s PhD thesis [3] . The reader is re- 
ferred to [3] for a more in-depth presentation which includes formal proofs, a description 
of the experimental results, and more motivating discussion. 




272 L. Beringer 



2 Syntax and Operational Semantics 

We consider a simplified model where a processor core consists of a number of typed 
functional units which are located in parallel to each other and are fed by instruc- 
tion queues. Operands are communicated through registers or operand queues, and the 
syntax of our language treats both mechanisms identically. We have instructions like 
add opi 0P2 op^, dupl^“ opi 0P2 op^ and if opi rii rz2 where the op^ denote 
registers or operand queues, fu represents a functional unit and the rii denote program 
labels. 

Specific processor models are defined by giving dynamic semantics for the lan- 
guage. The sequential model of operation is defined by a relation C D between 
configurations C and D and an instruction sequence t. Configurations consist of a reg- 
ister bank, a component describing the content of all operand queues (technically a 
map from operand queue identifiers to sequences of values), and a memory compo- 
nent. On arrival at a functional unit, an instruction awaits its operands in the queues 
or registers as indicated in its opcode. Whenever its functional unit becomes available, 
the instruction executes which involves consuming operands from (and sending results 
to) registers and operand queues. The relation C -4 D is defined along the syntactic 
structure. Rules for individual instructions employ micro-instructions for read and write 
access to operand queues, registers and memory. For example, executing the sequence 
[l]ldc 4 qi [ 2 ]dupl'^™ qi qi q2 [ 3 ]addqi q2 q2 in an initially empty state leads to q2 
containing the single value 8 and qi being empty. 

The dynamic semantics may be used to inspect the execution of programs by un- 
folding the derivation tree for judgements C D. Properties of the dynamic semantics 
such as determinism may be proven using structural induction. 



Alternative Dynamic Semantics. In addition to the sequential semantics, [ 3 ] defines 
a semantics for distributed (super-scalar) execution where instructions interleave. No 
assumptions are made regarding the delays inside functional units. We also consider 
operand queues of finite length. The relationships between these semantics correspond 
to the verification conditions in traditional processor verification, but may be proven by 
structural induction. For example, the distributed model subsumes in-order execution but 
admits additional interleavings, governed by the availability of operands in the operand 
queues. 

3 Static Semantics 

In general, each dynamic model of execution gives rise to specific run time hazards, 
i.e. conditions under which some programs do not execute correctly. Static semantics 
allows one to detect many classes of hazards syntactically. 

For the sequential model of operation, the typical hazard occurs if an instruction fails 
to execute due to the lack of operands in the appropriate operand queues. This condition 
depends on the initial configuration, but also on contextual instructions. In the case of a 
loop, such a deadlock may only become manifest after a number of iterations. 




A Programming Language Based Analysis of Operand Forwarding 273 



The type system we present employs a fragment of linear logic [6]. Referring the 
reader to [3] for formal details, we consider types which are linear products over the 
set of operand queues and registers, where registers are modelled by exponentials 
We thus abstract from the particular values of operands and from the order of items 
in each queue. Our type system contains one axiom for each instruction form. To each 
instruction we associate a pair of types which relate configurations prior to the execution 
to the shape after the instruction has been executed. Typed instructions are composed 
to instruction sequences using a cut rule. Each straight-line sequence of code is again 
associated a pair of pre- and post-types. At branch points, we require the net effect of 
each loop body on the number of elements in each queue to be neutral, similar to work 
by Stata-Abadi [15]. Operands expected by a loop body must be provided by earlier 
basic blocks and all operands created in the body must either be consumed immediately 
or be passed on to successor basic blocks. 

The inference of a typing derivation proceeds by weakening minimal typings of 
basic blocks until a unification of types at basic block boundaries is obtained. Failure of 
unification indicates the presence of a loop where each iteration consumes more values 
from the operand queues than it produces, or vice versa. 

The soundness of the type system guarantees that well-typed code will not get stuck 
due to insufficiently many operands. The proof of this result proceeds by structural 
induction; first single instructions are considered by proving the soundness of the axioms, 
then straight-line code is considered by proving the soundness of the cut rule, and finally 
full programs are considered using the rule for combining basic blocks. Thus, well-typed 
programs will either diverge or will successfully complete, irrespectively of the number 
of loop iterations. The size of intermediate configurations is statically bound. 



Alternative Models of Execution. In [3] we generalise our analysis to the alternative 
dynamic semantics. The typical hazards for distributed execution are race conditions 
and functional non-determinism as the execution of each instruction is triggered purely 
by the presence of operands. In our approach, these hazards are seen as a joint property 
of program and processor. Instead of immediately introducing hardware mechanisms 
for synchronisation we employ static semantics to identify non-deterministic programs. 
We extend the type system to detect race conditions and consider various techniques for 
guaranteeing that the corresponding serialisation requirements are met. Indeed, many 
programs may be serialised without additional synchronisation hardware. 

A particular advantage of our analysis is observed for the model with operand queues 
of finite length. Here, the characteristic error condition consists of a deadlock due to an 
operand queue overflow. Our analysis shows that the absence of deadlock is preserved for 
deterministic programs when the length restrictions are relaxed, while for other programs 
this is in general not the case. 

4 Program Analysis 

The third aspect of a programming language based approach consists of the ability to 
formally relate low-level properties to program analysis frameworks [12]. We present a 
dataflow analysis for a labelled intermediate language for detecting when an intermediate 
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value is used exactly once. These read-once values are candidates for forwarding, as their 
single usage corresponds to the deletion from the operand queue during a read access. 

The analysis targets the dynamic number of uses of an intermediate variable: any 
two assignments must be separated by exactly one read-access and no values should be 
left over at the end of a program run. We generalise the dataflow equations for liveness 
[ 1 2] by using a four-element lattice £ and say that a pair of functions fwd entry , fwd exit ■ 
Labp —!■ Varp -4 £ is a solution if 



fwd exitW(x) 

entry (^) (^) 



0 if f G final(P) 

LI (t,t’ ) eflow(P) fwd entry {f-'){x) Otherwise 

uses(£)(x) if X € kill{£) 

uses{£){x) © fwdexit{£){x) otherwise 



( 1 ) 

( 2 ) 



where kill and uses are again generalisations of the corresponding functions in the 
analysis of liveness. The forwardability information of a solution is contained in the 
component fwd exit- In [3] we show that a value assigned to a variable x at a program 
point £ is read exactly once if fwd exit {£){x) = 1 holds. Variables x for which x G 
kill{£) implies fwdexit{£){x) = 1 for all £ may thus be deleted after any read access. 
Notice the similarity to the characterisation of useless variables by liveness analysis. 
The proof of this characterisation formally relates (1) and (2) to a dynamic semantics of 
the intermediate language. 



Compilation Based on Dataflow Solutions. Based on dataflow analysis, a compiler 
may convert intermediate programs into assembly code. The allocation of operand queues 
to read-once variables differs from register allocation as the order of writing must co- 
incide with the order of reading. In [3] we demonstrate how conflict graphs between 
read-once variables may be obtained similarly to conflict graphs for register allocation 
and prove the functional correctness of a translation which maps adjacent variables to 
different operand queues. We also show that the resulting code is well-typed and thus 
structurally correct with respect to the underlying hardware. The existence of weaken- 
ings for satisfying the typing condition at basic block boundaries is guaranteed by the 
dataflow solutions, and loops are of neutral net effect. Indeed, the typing judgements may 
be formally obtained from the dataflow solutions, eliminating the need for an assembly 
level type inference. 



Experimental Results. The dataflow analysis was implemented for two conversions 
of JVM code into the intermediate language and exercised on the Linpack benchmark 
suite. We observed that nearly all usage of the operand stack may be translated into 
forwarding if an SSA-like conversion scheme is used. Furthermore, the number of allo- 
cated registers decreased by up to 50%, even if each operand queue may only be used for 
operands sent to a specihc functional unit. More significant than these static measures 
are dynamic measures: our analysis shows that on average 65% of the (central) register 
read operations turn into (local) operand queue reads, while the corresponding number 
for write operations is 62%. 
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5 Discussion 

We presented an analysis of forwarding based on dynamic and static semantics of a lan- 
guage with explicit forwarding. We demonstrated the ability of programming language 
technology to eliminate important classes of error conditions (deadlocks and race con- 
ditions) and to analyse the forwarding potential of programs. Interpreting our language 
as the compiler-visible definition of a processor leads to a verification approach which 
emphasises that overall system correctness depends as much on program properties as on 
the correctness of processor implementations. On the other hand, it may be undesirable 
to expose operand queues explicitly to the programmer. Under this perspective, our anal- 
ysis demonstrates how a separation between functional and implementational aspects 
of forwarding may be achieved. Future work is needed to identify how the dataflow- 
based compilation may be related to hardware implementations. Although the technical 
results apply only to the specific model of computation considered in this paper, we thus 
argue that type systems and other syntax-directed formalisms provide a solid basis for 
structured reasoning about interactions between processor architecture and compilation. 



Acknowledgements. The author is grateful to Colin Stirling and Ian Stark for super- 
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Abstract. We present a verification algorithm that can automatically 
switch from RAM based verification to disk based verification without 
discarding the work done during the RAM based verification phase. This 
avoids having to choose beforehand the proper verification algorithm. 
Our experimental results show that typically our integrated algorithm is 
as fast as (sometime faster than) the fastest of the two base (i.e. RAM 
based and disk based) verification algorithms. 



1 Introduction 

Disk based verification algorithms [4, 5, 8, 3, 2] turn out to be very useful to coun- 
teract state explosion (i.e. the huge amount of memory required to complete state 
space exploration). However, using a disk based verification algorithm for a task 
that could have been completed just using a RAM based verification algorithm 
results in a waste of time. Unfortunately it is hard to predict beforehand the size 
of the set of reachable states so as to use the proper (RAM based or disk based) 
verification algorithm. 

In this paper we present an explicit verification algorithm that can automa- 
tically switch from RAM based verification to disk based verification without 
discarding the work done during the RAM based verification phase. This avoids 
having to choose beforehand the kind of verification algorithm, thus saving on 
the verification time. 

Our main contributions can be summarized as follows. 

~ We present (Section 3) an integration scheme (we call it serialization scheme) 
for the RAM based verification algorithm presented in [9] and the disk based 
verification algorithm presented in [2] . 

~ We present (Section 4) experimental results on using our serialization scheme 
implemented within the Mun^ verifier. Our experimental results show that 

* This research has been partially supported by MURST projects: MEFISTO and 
SAHARA 
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FIFO_Queue Q; HashTable T; 
bf s (init_states , next) {. 

foreach s in init_states Enqueue(Q, s) ; /*load Q with init states*/ 
foreach s in init_states Insert(T, s) ; /*mark init states as visited*/ 
while (Q is not empty) { s = Dequeue (Q); /* current state */ 
foreach s’ in next(s) /* expand current state */ 

if (s’ is not in T) -[Insert (T, s’); Enqueue (Q, s’ );]■}■} 



Fig. 1. Explicit Breadth First Visit (RAM based) 



typically our integrated algorithm is as fast as (sometime faster than) the fas- 
test of the two base (i.e. RAM based and disk based) verification algorithms. 
This means that on a single machine we are able to run two verification 
attempts (RAM based and then disk based) within the time taken by the 
first terminating verification attempt. 



2 State Space Exploration Algorithms 

Our goal is to devise a serialization scheme for the RAM based state exploration 
algorithm presented in [9] (CBF, for Cached Breadth First visit in the following) 
and the disk based state exploration algorithm presented in [2] (DBF, for Disk 
Breadth First visit in the following) . 

Figure 1 shows the algorithm and data structures used by a Breadth First 
(BF) visit. Both the Enqueue () operation on BF queue Q as well as the Insert () 
operation on the visited states hash table T in Figure 1 may fail because of lack 
of memory. In such cases the BF visit stops with an out of memory message. 

Algorithm CBF [9] implements BF queue Q on disk and, most importantly, 
replaces with a cache table the hash table T used by the standard BF visit in 
Figure 1. Using a cache table rather than a hash table means that, upon a 
collision, CBF may forget visited states and, as a result, it may revisit states. 
To prevent nontermination due to revisiting states, CBF terminates when the 
collision rate (i.e. the ratio between the number of collisions and the number of 
insertions) is above a user given threshold. 

Algorithm DBF [2] is a disk based version of CBF. DBF uses a hash table 
M to store signatures (e.g. see [6]) of recently visited states, a file D to store 
signatures of all visited states {old states) and splits the BF queue Q of CBF 
into two queues: Q_ck and Q_unck. DBF uses the checked queue Q_ck to store the 
states in the currently explored BF level and uses the unchecked queue Q_unck 
to store the states that are candidates to be on the next BF level. At the end of 
each BF level, DBF uses file D to remove old states from Q_unck. 

Note that with DBF all data structures that grow with the state space size 
(namely: D, Q_ck, Q_unck) are on disk, thus DBF bottleneck is computation time, 
rather than memory space. 
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Fig. 2. Serialization scheme for CBF, DBF. 



3 Serializing CBF and DBF 

In our context, a serialization scheme is an algorithm that allows us to stop the 
current verification task and to resume it possibly using a different algorithm 
without losing the work previously done. 

Let 5 be a FSS and Time(A, S) the time needed by algorithm A to complete 
state space exploration of S. A serialization scheme for state space exploration 
algorithms A and i? is a state space exploration algorithm [A, i3] s.t. there exist 
time instants 0 < ti < t 2 < Time([A, i3], 5) s.t. for all t < ti, [A,B] behaves as 
A and for all t>t 2 , [^, .B] behaves as B. 

Of course a serialization scheme for algorithms A and B is interesting only 
if the ratio Time([A, B], iS)/min(Time(A, 5), Time(B, 5)) {serialization ratio) is 
close to 1 for most FSS S. This means that on a single machine we are able to 
run two verification attempts (namely A and B) within the time taken by the 
first terminating verification attempt among the two. 

In this section we present a serialization scheme for the RAM based state 
space exploration algorithm CBF [9] and the disk based state space exploration 
algorithm DBF [2] . 

To switch from CBF to DBF we have to save on disk the current status of 
CBF in such a way that CBF status disk image can then be used to initialize 
DBF data structures. Figure 2 summarizes our serialization scheme. 

CBF status disk image includes the following elements: 

1. A file {queue file in Figure 2) containing BF queue Q of Figure 1. 

2. A file {state space file in Figure 2) containing the visited states (namely, cache 
table T of Figure 1). 

3. A file {administrative file in Figure 2) containing administrative information 
about the verification process. For example, such a file may contain: compres- 
sion options with which CBF has been started (e.g. bit compression [1], hash 
compaction [6]); random seeds used in various hashing functions (e.g. in the 
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computation of state signatures [6]), the BF level reached in the BF visit, the 
number of states visited so far, etc. 

In our serialization scheme switching from CBF to DBF is normally requested by 
the serialization controller (Figure 2) when CBF collision rate becomes greater 
than a user given threshold. 

Serialization is requested by sending a signal to (the suitably modified) CBF. 
Indeed, to keep easy and efficient our serialization scheme, we only allow CBF to 
be stopped when it is easy to dump CBF current status to disk. Namely, before 
a new state is dequeued from the verification queue Q. The CBF queue Q, the 
cache T and the parameters are saved to disk in the respective files (Figure 2). 

To initialize DBF using the disk image of CBF, first DBF parameters defining 
state format are overridden by CBF parameters saved in the administrative file 
on disk. This is needed to ensure compatibility between the data format saved 
on disk and DBF data format. 

CBF queue Q stored on disk is then loaded and connected to the DBF checked 
queue Q_ck. This is the best choice since Q has already been checked w.r.t T. DBF 
unchecked queue Q_unck and DBF hash table M are left empty. DBF history file 
D is initialized with the set of visited states in T (Figure 2). After the above steps 
DBF can start normally. 

4 Experimental Results 

We implemented both algorithms CBF and DBF within the Mur<^ verifier. This 
was done as illustrated, respectively, in [9] and [2]. The resulting verifiers are 
called, respectively, Cached-Munp and Disk-Munp. Thus, not surprisingly, we 
implemented the serialization scheme outlined in Section 3 within the Mur(p 
verifier. We call Serial-Munp the resulting verifier. Unless otherwise stated, in 
this Section CBF denotes Cached-Murtp, DBF denotes Disk-Mur(/? and [CBF, 
DBF] denotes Serial-Mur(/?. 

Serial-Muri^ runs first Cached-Mur(/? until it completes the verification or the 
collision rate hits a user given threshold 7 (set to 0.1 in our experiments). If the 
collision rate is greater than or equal to 7, Serial-Muri^ switches to Disk-MurT). 

Note that, from [9], we know that Cached-Mur^) behaves as standard Mur(p 
(both for explored states and verification time) if the collision rate is low. The 
limitation to 10% of collision rate used in our experiments makes Cached-Mur(p 
very similar to standard Muri^ in terms of performance. 

In this Section we report the experimental results we obtained using [CBF, 
DBF]. Our goal is of course to assess effectiveness of our serialization scheme. Let 

5 be the FSS to be verified. Effectiveness in our case means that the serialization 
ratio (Section 3) (Time([CBF, DBF], S)/ min(Time(CBF, S), Time(DBF, S))) 
« 1 . 

We know [2] that if CBF has enough RAM then Time(CBF, S) < Time(DBF, 
S). In such cases [CBF, DBF] never switches to DBF and thus behaves as CBF. 
Thus in such cases (Time([CBF, DBF], 5)/min(Time(CBF, S), Time(DBF, 5))) 
« 1 holds. 
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Table 1. Seriaf-Mur</9 versus Disk-Mur^. 
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Hence the interesting cases for us are those in which CBF does not have 
enough RAM to complete the verification task. In such cases Time(CBF, S) = 
oo, thus min(Time(CBF, S), Time(DBF, 5)) = Time(DBF, S). Thus we need 
to check whether Time([CBF, DBF], 5)/Time(DBF, S) « 1, which means that 
[CBF, DBF] completes verification taking about the same time as DBF (even 
after trying CBF first). 

To carry out our experiments we used the benchmark protocols included in 
the Mur(/? distribution [f] that need at least (about) fOOKb of memory to be 
verified by standard Mur(/?, and the kerb protocol from [7]. 

First, for each protocol p in our benchmark we determined the minimum 
amount of memory M(p) needed by Muri^ (version 3.1 from [1]) to complete the 
verification. Then we compared Serial-Mur<p performances with those of Disk- 
Mur<p for decreasing fractions of such a memory amount. Namely, we ran each 
protocol p with memory limits 0.5M(p), 0AM{p) and 0.3M(p). 

In this way, we experimented our approach under conditions in which Serial- 
Murtp at some point is forced to switch to Disk~Mur(/? since there is not enough 
RAM for Cached-Mur(/j to complete its verification task. 

Our results are shown in Table 1, where columns correspond to the memory 
fraction used for the experiment (e.g. column 0.5 corresponds to half of the 
needed memory), and rows report the results obtained for a protocol in terms of 
fired rules, visited states and time to complete the verification. To highlight the 
usefulness of our approach, we report these results as ratios between the values 
obtained by Serial-Mur(p and the values obtained using Disk-Muri^ on the same 
protocols with the same memory restrictions. Thus row Time in Figure 1 gives 
us the serialization ratio. 

The results in Table 1 show that using Serial-Murt^ two verification attempts 
(namely: Cached-Mur(p and then Disk-Muri^) take about the same time of the 
fastest terminating one (namely Disk-Muri^). Indeed, in Table I Time rows 
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range from 1.1 (i.e. a time overhead of 10%, worst case) to 0.8 (i.e. Serial-Mur(/? 
is 20% faster than Disk~Mur(/j), averaging to 0.99. 

From Table 1 we see that sometimes Serial-Mur<p is faster than Disk-Mur<^. 
This is because Serial-Mur(/? starts verification using a RAM based algorithm 
(CBF) which is faster than the disk based algorithm (DBF) to which Serial- 
Mur<^ switches only after part of the verification work has been done (in RAM) . 

Summing up, the results in Table 1 show that Serial-Mur(/? is typically as 
fast as (sometime faster than) the fastest terminating one among Cached-Mur(/? 
and Disk-Mur(/j. Thus, using Serial-Mur(/? we can run two verification attempts 
in the time normally taken by one. 

5 Conclusions 

We presented a verification algorithm that can automatically switch from RAM 
based verification to disk based verification without discarding the work done 
during the RAM based verification phase. 

Our experimental results show that typically our integrated algorithm is as 
fast as (sometime faster than) the fastest of the two base (i.e. RAM based and 
disk based) verification algorithms. This means that on a single machine we are 
able to run two verification attempts (RAM based and then disk based) within 
the time taken by the first terminating verification attempt. 
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Abstract. This paper explores the practicality of describing and verifying both 
the hardware and software components of System-on-Chip (SOC) architectures 
using Esterel. We describe experiments to design and build working hardware 
based around IBM’s CoreConnectXM Intellectual Property (IP) bus. The flow 
we analyse has been used to produce working hardware realized on Xilinx’ s 
FPGAs with soft 32-bit processors. Interesting properties about these systems 
have been proved by static analysis based on model checking. 



1 Introduction 

There has been a considerable interest in the ability to map a given function to either 
software or hardware in order to meet constraints for performance, area and time. 
Recently therehas alsobeen much interest in the use of assertions in hardware 
description languages to aid the verification process of complex systems. 
Conventional approaches for mapping a function to hardware or software typically 
involve performing totally separate software and hardware implementations (which is 
time consuming) or trying to derive one automatically from the other (which often 
produces poor results especially when one tries to infer efficient hardware from 
sequential software descriptions). The checking of assertions is often performed by a 
dynamic analysis i.e. simulation and this is known to have very poor coverage. This 
paper evaluates an approach which has some promising properties to help solve both 
of these problems. 

We report experiments with the Esterel V7 programming language [7] which we 
have used for the synthesis of both hardware and software. We also report our 
experience of performing static analysis of hardware systems with properties 
expressed graphically as synchronous observers which are checked using an 
embedded model checker. It is our belief that such graphical descriptions of assertions 
may be more accessible to engineers than grammar based approaches. 



2 Design and Property Specification in Esterel 

Previous work in the area of using Esterel for generating efficient protocol code has 
alreadybeen reported as part of the HIPPCO project [4][5]. Much has been written 
aboutusing Esterel for synthesizing software (specially C code) [3]. 
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An Esterel description can be automatically synthesized either into reactive C 
software or into a variety of hardware descriptions languages (e.g. VHDL and 
Verilog) and then synthesized into hardware. The ability to synthesize either hardware 
or software from the same statically analysable description is one of the most 
appealing aspects of this technology. Although we do not present them here Esterel 
also contains many other language constructs for concurrent programming. 

Our research investigation seeks to answer the following questions: 

1. Can the intuitive graphical safe state machine notation be used effectively by 
engineers for specifying assertions which can be statically checked? Might this 
notation be more accessible than grammar based approaches? 

2. Can Esterel descriptions be synthesized into efficient hardware and software 
(including a mixture of both) and work seamlessly with conventional vendor tool 
Assertions using Synchronous Observers? 

Emerging techniques for specifying assertions typically involve using an extra 
language which has suitable operators for taking about time and logic relationships 
between signals. These languages are often concrete representations of formal logics 
and assertion languages are really temporal logics which can be statically analysed. 
Can the graphical safe state machine notation provide an alternative way of specifying 
properties about circuits which has the advantage of being cast in the same language 
as the specification notation? And can these circuit properties be statically analysed to 
formally prove properties about circuits? 

To investigate these questions we performed an experiment which involved 
designing a peripheral for IBM’s OPB bus which forms part of IBM’s CoreConnectrM 
IP bus [6]. We chose the OPB bus because it is used by the MicroBIaze soft processor 
which is available on Xilinx’s FPGAs. 

An example of a common transaction on the OPB-bus is shown in Fig. 1. The key 
feature of the protocol that we will verify with an example is that a read or write 
transaction should be acknowledged within 16 clock ticks. Unless a control signal is 
asserted to allow for more time if a peripheral does not respond within 16 ticks then 
an error occurs on the bus and this can cause the system to crash. Not shown is the 
OPB_RNW signal which determines whether a transaction performs a read or a write. 
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Fig. 1. A sample OPB transaction 



We considered the case of a memory mapped OPB slave peripheral which has two 
device registers that a master can write into and a third device register that a master 
can read from. The function performed by the peripheral is to simply add the contents 
of the two ‘write’ registers and make sure that the sum is communicated by the ‘read’ 
register. A safe state machine for such a peripheral is shown in Figure 2. 

This generated VHDL for this peripheral was incorporated into Xilinx’s Embedded 
Developer Kit and it was then used as a building block of a system which also 
included a soft MicroBIaze processor, an OPB system bus and various memory 
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resources and interfaces. We wrote test programs to check the operation of the 
peripheral with a 50MHz OPB system hus. The peripheral always produced the 
correct answer. 




J 



CS. 




Fig. 2. An OPB-slave peripheral 



Having successfully implemented an OPB peripheral from the Esterel specification 
we then attempted to prove an interesting property about this circuit. We choose to try 
and verify the property that this circuit will always emit an OPB transfer acknowledge 
signal two clock ticks after it gets either a read or a write request. If we can statically 
prove this property we know that this peripheral can never be the cause of a transfer 
acknowledge timeout event. 

We expressed this property as a regular Esterel safe state machine as shown in 
Figure 3. This synchronous observer tracks the signal emission behaviour in the 
implementation description and emits a signal if the system enters into a bad state i.e. 
a read or write request is not acknowledged in exactly two clock ticks. 

One way to try and check this property is to try and use it in simulations to see if 
an error case can be found. Esterel Studio supports this by either simulation directly 
within the Esterel framework or by the automatic generation of VHDL 
implementation files and test benches which can check properties specified as 
synchronous observers. 

However, the Esterel Studio system also incorporates a built-in model checker 
(Prover-SL from Prover Technology) which can be used to try and prove such 
properties. We use the latest V7 version of the Esterel language which allows us to 
reason about data as well as control which is an improvement from previous versions 
of the language. We configured the model check to see if the error signal 
corresponding to a bad state being entered is ever emitted i.e. might the circuit take 
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Fig. 3. An assertion expressed as a synchronous observer 



longer than two clock ticks to acknowledge a transfer? It took Esterel Studio less than 
two seconds on a Sun Sparc Ultra-60 workstation to prove that this signal is never 
emitted. 

Gsverify -v OPB.eid -checkisO XFERACK_MISSING 

esverify: Reading model from file "OPB.eid". 

esverify: Checking if output "XFERACK_MISSING" is 0 

esverify: Start model-checking properties 

esverify: Verification complete for signal XFERACK_MISSING: esverify: -- 

esverify: Model-Checking results summary 

esverify: esverify: Status of output "XFERACK_MISSING" : Never emitted. 

We then produced a deliberately broken version of the peripheral which did not 
acknowledge read requests. Within two seconds the software was able to prove that 
there is a case when the acknowledge signal is not asserted after a transaction and 
provided a counter-model and VCD file. 

A conventional approach to catching such approach bugs involves either 
simulation (which has poor coverage) or the use of bus monitors which snoop the bus 
at execution time looking for protocol violations. A failure to acknowledge a 
transaction is one of the types of bugs that such systems can be configured to catch. 
However, it is far more desirable to catch such problems with a static analysis. We are 
currently trying to convert a list of around 20 such bug checks used in a commercial 
OPB bus monitor into a collection of Esterel synchronous observers to allow us to 
check peripheral protocol conformance with static analyses. 



3 Future Work 

Here we described our experience with just one system and method for realizing 
assertions which can be statically analysed. We have started the process of repeating 
these static property checks using Sugar (with IBM’s FOCs system) and VERA with a 
view to writing a comparative study of the pros can cons of each approach. Sugar- 
based systems could be used indirectly in the Esterel flow by compiling synchronous 
observers into rules (which reside in a separate file from the VHDL design) and then 
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using the generated VHDL with vendor software for performing static and dynamic 
analyses of Sugar-hased assertions. This would allow Esterel safe charts to act as 
graphical front end for some types of Sugar assertions. 

For the verification of CoreConnect protocols we are developing Esterel models 
for the PLB (fast complex 64-bit system bus), OPB (a peripheral bus used as a 32-bit 
bus in Xilinx IP and used as the main bus for Xilinx’s soft processor) and DCR (a 
simpler device control register bus) components which can then be used to help write 
synchronous observers for IP blocks without replicating the functionality of the 
system arbiters in each observer. Previous work [8] using the model checker Rulebase 
[1][2] for proving properties about CoreConnect arbiters suggests that this approach 
should be feasible. 

To investigate how feasible it is to make hardware/software trade-offs using this 
flow we are developing implementations for network-on-chip protocols which are 
implemented in a combination of hardware and software. Using this flow can 
experiment with what portions need to be in hardware for performance and we may 
also be able to perform interesting static analyses of protocol behaviour, performance 
and correctness. 



4 Conclusions 

The approach of using Esterel to produce hardware and software seems to show some 
promise. Initial experiments show that serviceable hardware and software can be 
produced and implemented on real hardware and embedded processors. The 
possibility to enter system specifications graphically makes this method much more 
accessible to regular engineers than competing formalisms which uses languages 
which are quite different to what engineers are used to. For any realistic system the 
developer still has to write some portions textually and become aware of the basic 
underlying principles of Esterel. It remains to be seen if the cost of learning this 
formalism is repaid by increased productivity, better static analysis and the ability to 
trade off hardware and software implementations. 

An appealing aspect of this flow is the ability to write assertions in the same 
language as the system specification. This means that engineers do not need to learn 
yet another language and logic. Furthermore, the formal nature of Esterel’ s semantics 
may help to make static analysis easier. Our initial experiments with using the 
integrated model checker are certainly encouraging. However, we need to design and 
verify more complex systems before we can come to a definitive conclusion about 
this promising technology for the design and verification of hardware and software 
from a single specification. 

A very useful application of this technology would be to task-based dynamic 
reconfiguration. This method would avoid the need to duplicate implementation effort 
and it would also allow important properties of dynamic reconfiguration to be 
statically analysed to ensure that reconfiguration does not break working circuits. 

There are some limitations to the technique we presen here. There are some 
refinements that need to be made to the Esterel language to properly support hardware 
description. Most of these requirements are easily met without upsetting the core 
design of the language. Examples include a much more flexible way of converting 
between integers and bit-vectors and to allow arbitrary precision bit-vectors. 
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Currently performing an integer-based address decode for a 64-bit bus is possible in 
Esterel but one has to process the bus in chunks not larger than 3 1 bits. 

“Virtex-II” is a trademark of Xilinx Inc. “CoreConnect” is a trademark of IBM. 
We would like to thank the staff at Esterel Technologies for their generous assistance 
during this project. 
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Abstract. This paper shows how classic inductive assertions can be 
used in conjunction with an operational semantics to prove partial cor- 
rectness properties of programs. The method imposes only the proof obli- 
gations that would be produced by a verification condition generator but 
does not require the definition of a verihcation condition generation. The 
paper focuses on iterative programs but recursive programs are briefly 
discussed. Assertions are attached to the program by defining a predicate 
on states. This predicate is then “completed” to an alleged invariant by 
the definition of a partial function defined in terms of the state transi- 
tion function of the operational semantics. If this alleged invariant can be 
proved to be an invariant under the state transition function, it follows 
that the assertions are true every time they are encountered in execution 
and thus that the post-condition is true if reached from a state satisfy- 
ing the pre-condition. But because of the manner in which the alleged 
invariant is defined, the verification conditions are sufficient to prove in- 
variance. Indeed, the “natural” proof generates as subgoals the classical 
verihcation conditions. The invariant function may be thought of as a 
state-based verihcation condition generator for the annotated program. 
The method allows standard inductive assertion style proofs to be con- 
structed directly in an operational semantics setting. The technique is 
demonstrated by proving the partial correctness of a simple bytecode 
program with respect to a pre-existing operational model of the Java 
Virtual Machine. 



1 Summary 

This paper connects two well-known approaches to program verification: opera- 
tional semantics and inductive assertions. The paper shows how one can adopt 
the clarity and concreteness of a formal operational semantics while incurring 
just the proof obligations of the inductive assertion method, without writing 
a verification condition generator or other extra-logical tool. In particular, the 
formal definition of the state transition function can be used directly to generate 
verification conditions for annotated programs. 

In this section the idea is presented in the abstract. Some details are skipped 
and a deliberate confusion of states with formulas is perpetrated to convey the 
basic idea. Subsequently, the method is applied to a particular formal operational 
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Fig. 1. The One- Loop Program tt with Annotations 



semantics, program, annotation, mechanical theorem prover, etc., to demon- 
strate that the basic idea is practical. 

Consider a simple one loop program tt (Figure 1) that concludes with a HALT 
instruction. Assume instructions are addressed sequentially, with a being the 
address or label of the first instruction and 7 being the address or label of the 
HALT. Let the pre- and post-conditions of the program be P and Q respectively. 
The arrows of Figure 1 indicate the control flow; functions /, g, and h indicate 
the compound state transitions along the arcs and t is the test for staying in the 
loop. R is the loop invariant and “cuts” the only loop. The partial correctness 
challenge is to prove that if P holds at a then Q holds whenever (if) control 
reaches 7. 

To give meaning to such programs with an operational semantics, one for- 
malizes the abstract machine state and the effect of each instruction on the state. 
Typically the state, s, is a vector or n-tuple describing available computational 
resources such as environments, stacks, flags, etc. It is assumed here that the 
state includes a program counter, pc(s), and the current program, prog{s)), 
which are used to determine the next instruction. Instructions are given mean- 
ing by defining a state transition function step. Typically, step (s) is defined by 
considering the next instruction and transforming the state components accord- 
ingly. For example, a LOAD instruction might advance the program counter and 
push onto some stack the contents of some specified variable. More complicated 
instructions, such as method invocation, may affect many parts of the state. The 
HALT instruction is particularly simple; it is a no-op. 

It is convenient to define an iterated step function: 

\ _ j s if fc = 0 

( , sj otherwise 

and to make the convention that Sk = run {k, s). 




Inductive Assertions and Operational Semantics 



291 



Given this operational semantics, the formalization of the partial correctness 
result is 

Theorem: Correctness of Program tt. 

pc (s) = a A prog (s) = tt A P (s) A pc (sk) = 7 — >■ <3 (sfc)- 

Proof. In an operational semantics setting, theorems such as the Correctness of 
Program tt are proved by establishing an invariance Inv (s) with the following 
three properties: 

1. Inv (s) ^ Inv {step (s)), 

2. pc (s) = Of A prog (s) = tt A P (s) — >■ Inv (s), and 

3. pc (s) = 7 A prog (s) = tt A Inv (s) — >■ Q {s). 

The main theorem is then proved as follows. The inductive application of 
property 1 produces 

4. Inv {s) ^ Inv {sk)- 

Furthermore, instantiation of the s in property 3 with Sk produces 

5. pc (sfc) = 7 A prog {sk) = tt A Inv (sk) -)> Q (sfc). 

We assume no instruction in tt changes the program; hence prog (s) = prog (sfe). 
The Correctness of Program tt then follows immediately from 2, 4, and 5. □ 
Property 1, above, is problematic; it forces the user of the methodology to 
characterize all the states reachable from the chosen initial state. Contrast this 
situation with that enjoyed by the user of the inductive assertion method, where 
assertions are attached only to certain user-chosen cut-points, as in Figure 1. 
An extra-logical process, which encodes the language semantics as formula 
transformations, is then applied to the annotated program text to generate 
proof obligations or verification conditions 

VCl. P(s)^P(/(s)), 

VC2. R (s) A t ^ R {g (s)), and 
VC3. R {s)A^t Q{h{s)). 

If these formulas are proved, the user is then assured that if P holds initially 
then Q holds when (if) the program terminates. 

To render this assurance formal, i.e., write it as a formula, one must adopt 
some logic of programs, i.e., a logic that allows the combination of classical 
mathematical expressions about numbers, sequences, vectors, etc., with program 
text and terminology. The resulting programming language semantics is extra- 
logical in the sense that it is expressed as rules of inference in a metalanguage 
and is not directly subject to formal analysis within the logic. ^ In contrast, in the 
operational approach, the semantics is expressed within the language (typically 



^ See however the discussion of [3] the next section. 
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as defined functions or relations on states), programs are objects in the logical 
universe, and the properties of both — programs and the semantic functions and 
relations - are subject to proof within the logic. 

The central question of this paper is whether it is possible to have the best 
of both worlds: the concreteness and clarity of an operational semantics in a 
classical logical setting but the elegance and simplicity of an inductive assertion- 
style proof. The central question may be put bluntly as “Is it possible to prove 
the formula named ‘Correctness of Program tt,’ above, directly from VC1-VC3?” 
The answer is “yes.” 

Recall that the proof of ‘Correctness of Program tt’ required the definition 
of Inv (s) satisfying properties 1-3 above. The key to constructing an induc- 
tive assertion-style proof in an operational setting is the following definition of 
Inv (s). 



Inv (s) 



prog (s) = 7 T A P (s) 
prog (s) = 7 T A P (s) 
prog (s) = IT A Q (s) 
Inv {step (s)) 



if pc (s) = a 
if pc (s) = P 
if pc (s) = 7 
otherwise 



The logician will immediately ask whether there exists a predicate satisfying 
this equivalence. The affirmative answer is provided in [10]. The logical crux 
of the matter is that Inv (s) is defined with tail-recursion and there exists a 
satisfying and total witness for every tail-recursive equivalence. If some loop in 
the program is not cut, the equivalence may not uniquely define a predicate, but 
at least one witness exists. 

Inv (s) clearly has properties 2 and 3. It therefore remains only to prove 
property 1. As will become apparent, the proof that Inv (s) has property 1 will 
generate the verification conditions as subgoals. To drive this home, we describe 
the process by which the proof is constructed rather than merely the formulas 
produced. Recall Figure 1. Successive steps from a state s with pc a eventually 
produce the state / (s) with pc fi. Similarly, if t, then successive steps from a 
state s with pc produce g (s) with pc fi, and if ~<t, then successive steps from 
a state s with pc produce h (s) with pc 7 . Furthermore, repeated symbolic 
expansion and simplification of the step function produce the transformations 
described by f g, and h. 

Theorem: Property 1. 



Inv (s) -A Inv {step (s)) 

Proof. Consider the cases on pc (s) as used in the definition of Inv. 

Case: pc (s) = a. The hypothesis, Inv (s) may be simplified to prog (s) = tt A 
P (s). Consider the conclusion, Inv {step{s)). Symbolic simplification of step (s), 
given pc (s) = a and prog (s) = tt, produces a symbolic state s' with pc (s') = 
a -|- 1. For program tt either a -|- 1 is /3 or it is none of the cut points a, 
or 7 . In the latter case, Inv {step (s)) = Inv (s') = Inv {step (s')) and stepping 
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continues until j3 is reached at state / (s). Hence, Inv {step (s)) = R{f (s')) (since 
prog (/ (s) ) = 7t) . Thus, this case simplifies to the goal 

pc (s) = a A prog (s) = tt A P (s) — >■ i? (/ (s)). 

This is just VCl (with two now-irrelevant hypotheses, given traditional assertions 
P and R). 

Case: pc (s) = [3. The hypothesis Inv (s) simplifies to prog (s) = tt A R (s). Then 
the symbolic simplification of step (s) in the conclusion produces a bifurcated 
symbolic state whose program counter depends on test t. Repeated expansions 
of the definition of Inv on both branches of the state eventually reach states g (s) 
and h (s) at which Inv is defined. The results are VC2 and VC3, respectively. 

Case: pc{s) = 7. The hypothesis Inv (s) simplifies to prog (s) = tt A Q{s). 
But the step (s) in the conclusion simplifies to s because the instruction at 7 
in 7T is the no-op HALT. Hence, Inv {s) = Inv {step {s)) and this case is trivial 
(propositionally true independent of the assertions). 

Case: otherwise. Since pc (s) is not one of the cut-points, Inv (s) = Inv {step{s)) 
by definition of Inv and this case is also trivial. 

□ 

Hence, if the verification conditions VC1-VC3 have been proved, the proof 
of property 1, the step-wise invariance of Inv, involves no assertion-specific rea- 
soning. More interestingly, given the definition of Inv, the proof generates the 
verification conditions by symbolic expansion of the operational semantics’ state 
transition function. 

Practically speaking this means that with a mechanical theorem prover and a 
formal operational semantics one can enjoy the benefits of the inductive assertion 
method without writing a verification condition generator or other extra-logical 
tools to do formula transformations. 

Another practical ramification of this paper is that it provides a simple means 
to define a step-wise invariant given only the assertions at the cut points. Step- 
wise invariants are frequently needed in operational semantics-based proofs of 
safety and liveness properties. 



2 Related Work and Discussion 

McCarthy [11] made explicit the notion of operational semantics, in which “the 
meaning of a program is defined by its effect on the state vector.” 

The inductive assertion method for proving programs correct was implicitly 
used by von Neumann and Goldstine in [4] and made explicit in the classic papers 
by Floyd [2] and Hoare [5]. The first mechanized verification condition generator, 
which generates proof obligations from code and attached assertions, was written 
by King [8]. Hoare, of course, rendered the inductive assertion method formal by 
introducing a logic of programs. From the practical perspective most program 
logics are mechanized with two trusted tools, a formula generator, here called a 
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VCG, and a theorem prover. It is not uncommon for the VCG to include not just 
language semantics as formula transformers but also some logical simplification 
(i.e., theorem proving) to keep the generated proof obligations manageable. 

A notable exception is the work of Gloess [3] where the Hoare semantics of a 
simple imperative programming language is formalized within the higher-order 
logic of PVS and mechanically checked proofs of several programs are carried 
out with PVS. As in the present work, Gloess’ proofs generate the verification 
conditions. The difference however is that the formal semantics is Hoare-style 
rather than operational and is thus designed to generate formulas. 

This paper contains one apparently novel idea: a step-wise invariant can be 
defined from the inductive assertions using the state-transition function. One 
may think of this as a methodology for obtaining a state-based verification con- 
dition generator from an operational semantics. By doing it on a per program 
basis the method avoids the need to generate or trust extra-logical tools. 

The use of inductive assertions in conjunction with a formal operational 
semantics to prove partial correctness results mechanically is not new. Robert 
S. Boyer and the author developed it for their Analysis of Programs course at 
the University of Texas at Austin as early as 1983. In that class, an operational 
semantics for a simple procedural language in Nqthm [1] was defined and the 
course explored program correctness proofs that combined operational semantics 
with inductive assertions. These proofs motivated the exploration of total versus 
partial correctness, Hoare logics, and verification condition generation. For an 
Nqthm proof script illustrating the use of inductive assertions in an operational 
semantics setting, see [12]. 

A recent example of the use of assertions to prove theorems about a program 
modeled operationally may be found in [15], where a safety property of a non- 
terminating multi-threaded Java system is proved with respect to an operational 
semantics for the Java Virtual Machine [14]. 

However, in the earlier work the invariant explicitly included an assertion for 
every value of the pc. (The invariant must recognize every reachable state and 
so must handle every pc; the issue is whether it does so explicitly or implicitly.) 

An alternative way to combine inductive assertions at selected cut points 
with an operational semantics in a classical formal setting is to formalize and 
verify a VGG with respect to the operational semantics. In [6], for example, 
an HOL proof of the correctness of a VGG for a simple procedural language is 
described. The work includes support for mutually recursive procedures. Formal 
proofs of the verification conditions could, in principle, be used with the theorem 
stating the correctness of the VGG, to derive a property stated operationally. 
But the method described here does not require the definition of a VGG much 
less a proof of its correctness. 

Logically speaking, a crucial aspect of the novel idea here is that the step- 
wise invariant is defined using tail recursion. The admission of a new function or 
predicate symbol via recursive definition is generally handled by a definitional 
principle that insures the existence (and often the uniqueness) of the defined con- 
cept. In many logics, this requires a termination proof. Admitting Inv under such 
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a definitional principle would require a measure of the distance to the next cut 
point and a proof that the distance decreases under step. That imposes a proof 
burden not generally incurred by the user of the inductive assertion method. 
(Every loop must be cut for the inductive assertion method to be effective; the 
question is whether that must be proved formally or merely demonstrated by 
the successful generation of the verification conditions.) 

The technique used here exploits the observation that Inv is tail-recursive and 
hence admissible without proof obligation, given the work of Manolios and Moore 
[10] in which it was proved that every tail-recursive equation may be witnessed 
by a total function. The tail-recursive function may not be uniquely defined by 
the equation — this might occur if insufficient cut points are chosen. Such a 
failure is manifested by an infinite loop in the process of generating/proving the 
step invariance. This is the same behavior a VCG user would experience in the 
analogous situation. 

The technique here is similar in spirit to one used by Pete Manolios [private 
communication] to attack the 2- Job version of the Apprentice problem [15]. 
There, he defined the reachable states of the Apprentice problem as all the states 
that could be reached from certain states by the execution of a fixed maximum 
number of steps. 

See [13] for a long version of this paper, including all proof scripts. 



3 A Demonstration of the Method 

To illustrate the technique a mechanized formal logic and an operational seman- 
tics must be introduced. In this paper we use the ACL2 logic [7]. In this logic, 
function application is denoted as in Lisp, e.g., run{k,s) is written (run k s). 

For the demonstration we choose a pre-existing operational semantics for a 
significant fragment of the JVM [9]. The model is called M5 [14] and it was 
chosen simply because it was available and it was realistic. 

The M5 model is fairly complex, requiring about 250 ACL2 definitions con- 
suming about 3000 lines of formalism on top of ACL2’s extensive support for 
discrete mathematics. In addition to many other JVM data types, M5 supports 
Java’s 32-bit twos complement integer arithmetic, here called “int arithmetic,” 
in which overflow is not signaled; adding one to the most positive int produces 
the most negative int. M5 models 138 bytecode instructions including those for 
the creation and initialization of instance objects in the heap, manipulation of 
static and instance fields, the invocation of static, special, and virtual methods, 
Java’s inheritance rules for method resolution, the creation of multiple threads, 
and synchronization via monitors. The model is operational in the sense that it 
can be executed on the output of Sun’s j avac compiler (after transformation of 
the class files into ACL2 constants). 

The M5 model of the JVM is a good example of an abstract machine that 
is sufficiently complicated that writing a VCG for it a serious and error-prone 
undertaking. 
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M5 is formalized by defining step and run functions as above. The state 
includes a thread table containing stacks of method invocation frames, a heap, 
and a class table of loaded classes. Each frame contains a pc, bytecoded program, 
local variables, and operand stack. The M5 step function takes two arguments 
instead of just one: (step th s) is the state obtained by stepping thread th in 
state s. The run function, instead of taking the number of steps, takes a list of 
thread identifiers, called a schedule, and steps those threads sequentially. 

Symbolic simplification of this semantics is central to the idea proposed here. 
Consider the following bytecode sequence (in the M5 parsed byte-stream format): 
(IL0AD_1) (ICDNST_1) (lADD) (IST0RE_1) . This sequence pushes the value of lo- 
cal variable 1 on the operand stack, pushes the constant 1, pops the first two 
items off the stack and pushes their int sum, and pops the stack into local 
variable 1. That is, the sequence corresponds to the Java assignment a = a+1; 
if a is allocated in local variable 1. Suppose M5 state s contains a thread, th, 
the active frame of thread th has pc 6 and that the bytecode sequence above is 
positioned starting at byte offset 6 in the current program. Suppose the locals 
of the frame are denoted by locals and the operand stack by stack. The symbolic 
simplification of (step th s) produces a symbolic state expression in which 
the active frame of thread th has pc 7 and operand stack (push (nth 1 locals) 
stack) . If three more such steps are taken the result is a symbolic state expression 
in which the active frame of thread th has pc 10 and the following expression, 
Zocafe) for its locals (update-nth 1 (int-fix (+ (nth 1 locals) 1)) locals). 
Note that the symbolic expression for local 1 in this environment, (nth 1 lo- 
cals’) simplifies to (int-fix (+ (nth 1 locals) D) using rewrite rules about 
nth and update-nth. 

4 An Iterative Program 

Below is an M5 program that decrements its first local, informally called n, by 2 
and iterates until the result is 0. On each iteration it adds 1 to its second local 
variable, here called a, which is initialized to 0. Thus, the method computes n/2, 
henceforth written (/ n 2), when n is even. It does not terminate when n is 
odd. 

The program is slightly simpler to deal with if it is assumed that n is a non- 
negative int. The program actually terminates for even negative ints, because 
Java’s int arithmetic wraps around: the most negative int, -2147483648, is even 
and when it is decremented by 2 it becomes the most positive even, 2147483646. 
For simplicity, the program concludes with the fictitious HALT instruction, which 
stops the machine. The program constant below is named *flat-prog* because 
it does not return to a caller but stops the machine. Method invocation is dis- 
cussed later in the paper. 

(defconst *flat-prog* 

M(IC0NST_0) ; 0 

(ISTDRE_1) ; 1 a := 0 

(IL0AD_0) ; 2 top of loop: 
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(IFEQ 14) 


3 


if n=0, goto 17 


(IL0AD_1) 


6 




(ICONST.l) 


7 




(lADD) 


8 




(ISTDRE.l) 


9 


a := a+1 


(ILOAD.O) 


10 




(IC0NST_2) 


11 




(ISUB) 


12 




(ISTORE.O) 


13 


n := n-2 


(GOTO -12) 


14 


goto top of loop 


(ILOAD.l) 


17 


push a 


(HALT) ) ) 


18 





Let the initial value of n be nO. The goal is to prove that if nO is a non- 
negative int and control reaches pc 18, then nO is even and (/ n 2) is on the 
stack. That is, if the program halts the initial input must have been even and 
the final answer is half that input. 

Rather than deal with integer division during the code proof, the following 
function is introduced. The decision to use this function rather than algebraic 
expressions to express the properties of the code is independent of the decision 
to express the properties with inductive assertions. 

(defun halfa (n a) 

(if (zp n) 
a 

(halfa (- n 2) (int-fix (+ a 1))))) 

Here, int-fix returns the integer represented by the low-order 32-bits of its ar- 
gument and thus implements int wrap-around. The inductive assertion method 
will be used to establish that if the program terminates it will leave (halfa nO 
0) on the stack. A second theorem, independent of the code, establishes that 
(halfa nO 0) is (/ n 2) under certain conditions. Such decomposition of code 
proofs into “algorithm” and “requirements” is standard in the ACL2 community 
and independent of whether inductive assertions are being used. It is possible, of 
course, to mix the two via inductive assertions about division or multiplication 
by two. 

5 The Assertions at the Three Cut Points 

The cut points, to which assertions will be attached, are at program counters 
0 (a), 2 (/?), and 18 (7). The assertions themselves, called P, R, and Q in the 
earlier treatment, are captured by the following function definitions. The names 
of the functions are, of course, irrelevant but indicate how they will be used. In 
the earlier treatment it was convenient to make these functions of state; here 
they are functions of the initial input nO and the relevant state components, 
namely n and a. 
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(defun flat-pre-condition (nO n) 

(and (equal n nO) 

(intp nO) 

(<= 0 nO))) 

(defun flat-loop-invariant (nO n a) 

(and (intp nO) 

(<= 0 nO) 

(intp n) 

(if (and (<= 0 n) 

(evenp n) ) 

(equal (halfa n a) 

(half a nO 0)) 

(not (evenp n) ) ) 

(iff (evenp nO) (evenp n) ) ) ) 

(defun flat-post-condition (nO value) 

(and (evenp nO) 

(equal value (halfa nO 0)))) 

The details of the assertions are not germane to this paper. The assertions are 
typical inductive assertions for such a program. They are complicated primarily 
because of Java’s int arithmetic. Haifa tracks the behavior of the program only 
as long as n stays non-negative. Things would be simpler if the pre-condition 
required that nO be even or if the post-condition did not assert that nO is even. 
These assertions were chosen to illustrate that operational semantics could be 
used to address partial correctness of non-terminating programs including the 
characterization of when termination occurs. 



6 Verification Conditions 

Given *flat-prog*, the informal attachment of the three assertions to the cho- 
sen cut points, and a VCG for the JVM, the following verification conditions 
would be produced. 

(defthm VCl ; entry to loop 

(implies (flat-pre-condition nO n) 

(flat-loop-invariant nO n 0))) 

(defthm VC2 ; loop to loop 

(implies (and (flat-loop-invariant nO n a) 

(not (equal n 0))) 

(flat-loop-invariant nO 

(int-fix (- n 2)) 

(int-fix (+ 1 a))))) 

(defthm VC3 ; loop to exit 

(implies (and (flat-loop-invariant nO n a) 

(equal n 0)) 

(flat-post-condition nO a))) 
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These are easily proved. The challenge is: how can these three theorems be 
used to verify a partial correctness result for *f lat-prog*? 



7 Attaching the Assertions to the Code 

In the earlier treatment of the method, the invariant conjoined each assertion 
with prog (s) = tt. Here we introduce an intermediate function to do this and 
also to name relevant components of the state. 

(defun flat-assertion (nO th s) 

(let ((n (nth 0 (locals (top-frame th s)))) 

(a (nth 1 (locals (top-frame th s))))) 

(and (equal (program (top-frame th s)) *flat-prog*) 

(case (pc (top-frame th s)) 

(0 (flat-pre-condition nO n)) 

(2 (flat-loop-invariant nO n a)) 

(18 (let ((value (top (stack (top-frame th s))))) 
(flat-post-condition nO value))) 

(otherwise nil))))) 

The let identifies parts of the JVM state of interest: the 0*^ local of thread th, 
called n, and the 1®* local of thread th, called a. It requires that the program 
being executed by the thread be *flat-prog* (“tt”). It then case splits on the 
pc of thread th and for program counters 0, 2, and 18 makes an assertion about 
n, a, and nO. The variable symbol value at the post-condition is bound to the 
value on top of the operand stack of the relevant thread at the conclusion of the 
program. 



8 The Nugget: Defining the Invariant 



The nugget in this paper is how the assertions, attached to selected cut points, 
are completed into a step-wise invariant on states. 

The invariant is introduced with the defpun (“define partial function”) utility 
of [10]. The assertions are tested at the three cut points and all other statements 
inherit the invariant of the next statement. This definition is analogous to that 
for Inv in the abstract treatment, except that the invariant also takes the initial 
input, nO, and the identifier of the relevant thread, th. 



(defpun flat-inv (nO th s) 

(if (or (equal (pc (top-frame th s)) 
(equal (pc (top-frame th s)) 
(equal (pc (top-frame th s)) 
(flat-assertion nO th s) 
(flat-inv nO th (step th s)))) 



0 ) 

2 ) 

18)) 
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9 Proofs 

Here is the key theorem, called “property 1 of Inii’’ or the step-wise invariant 
theorem. 

(defthm f lat-inv-step 

(implies (flat-inv nO th s) 

(flat-inv nO th (step th s)))) 

As noted earlier, the proof attempt generates the verification conditions (with 
a few extra hypotheses about the program counter and current program). If 
ACL2’s data base already contains the theorems VC1-VC3, those theorems are 
used to complete the proof of f lat-inv-step. If the verification conditions have 
not already been proved, the proof attempt here generates and proves them. 

Central to the process is the symbolic simplification of state expressions under 
the state transition function step. 

Having proved the invariance of flat-inv under step the next theorem in 
the mechanized “methodology” corresponds to property 4 of the earlier proof of 
the Correctness of Program tt. is trivial. The theorem states that flat-inv is 
invariant under arbitrarily long runs of the thread in question. 

(defthm flat-inv-run 

(implies (and (mono-threadedp th sched) 

(flat-inv nO th s)) 

(flat-inv nO th (run sched s)))) 

where 

(defun mono-threadedp (th sched) 

(if (endp sched) 
t 

(and (equal th (car sched)) 

(mono-threadedp th (cdr sched))))). 

Proof of flat-inv-run is trivial by induction and appeal to f lat-inv-step. 

Thus, if the initial state has pc 0 and satisfies the pre-condition, and, after 
some arbitrary mono-threaded run, a state with pc 18 is reached, then it satisfies 
the post-condition, namely, nO is even and the answer is (half a nO 0) . Formally 
this can be written as follows. 

(defthm flat -main 

(let ((si (run sched sO))) 

(implies (and (intp nO) 

(<= 0 nO) 

(equal (pc (top-frame th sO)) 0) 

(equal (locals (top-f ramie th sO)) (list nO any)) 
(equal (program (top-frame th sO)) *flat-prog*) 
(mono-threadedp th sched) 

(equal (pc (top-frame th si)) 18)) 
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(and (evenp nO) 

(equal (top (stack (top-frame th si))) 

(half a nO 0)))))) 

This is proved by using the instance of flat-inv-run obtained by letting s 
be sO. 

Flat -main is essentially the goal, except it characterizes the answer as 
(half a nO 0). If (/ nO 2) were preferred, either a separate proof relating 
(half a nO 0) to (/ nO 2) could be performed, or the assertions could be stated 
in terms of division in the first place. In any case, this issue is independent of 
the use of inductive assertions. 

It takes ACL2 approximately 8 seconds (on a 797MHz Pentium III) to prove 
f lat-inv-step, in which the verification conditions are generated by repeated 
symbolic expansion of step on the bytecode in *flat-prog*. The subsequent 
proofs of flat-inv-run and flat -main take less than 1.5 seconds in all. The 
only proof-specific lemmas developed for this exercise were mathematical lemmas 
on the properties of evenp int arithmetic when subtracting 2. 

Notice what has been accomplished. Flat -main is a partial correctness the- 
orem about a JVM program, formalized with an operational semantics. The 
creative part of the proof consisted of the definition of the three assertions. 
Users familiar with inductive assertions would find these assertions straightfor- 
ward (requiring only a few minutes to write down) . The proof of the key lemma, 
f lat-inv-step, generated (and requires the proof of) the classic verification 
conditions just as though a VCG for the JVM were available. But no VCG was 
defined. The proof does not establish termination of the code under the pre- 
conditions but does characterize necessary conditions to reach the HALT state- 
ment. Finally, neither the theorem nor the proof involved counting instructions 
or defining what is called a “clock function” in the Boyer-Moore community. 

10 Method Invocation and Return 

The HALT instruction in the previous program is fictitious but handy. Stepping 
the machine while on a HALT leaves the machine at the HALT. Thus, the invariance 
of the exit assertion is easy to prove once the exit is reached. In realistic code, the 
machine does not halt but returns control to the caller and non-trivial stepping 
continues. A useful inductive assertion methodology must deal with call and 
return. This paper does not discuss call and return in detail; see [13]. 

On the JVM, method invocation pushes a new stack frame on the invocation 
stack of the active thread. Abstractly, that frame may be thought of as contain- 
ing the bytecode for the newly invoked method with initial pc 0. The new frame 
contains an initially empty “operand stack” for intermediate results. When cer- 
tain return instructions are executed, the topmost item, v, on the operand stack 
is removed, the invocation stack is popped, and v is pushed onto the operand 
stack of the caller.^ 

^ Some forms of return implement void methods and return no v to the caller. 
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To deal with call and return via inductive assertions, two changes are made to 
the “methodology” described above. First, instead of using run to run the state 
a certain number of steps, the new function run-to-return is introduced, which 
runs a certain number of steps or until the state returns from the call depth, dO, 
at which the run was started. Second, the assertion function is changed so that 
the post-condition is asserted if the call depth is less than dO. 

To deal with recursive methods, one must characterize the stack of frames 
created by previous recursive calls so that returns produce states in which 
continued symbolic evaluation is possible. 

It should be possible to use this technique to express safety and liveness 
invariants for multi-threaded programs, significantly reducing the amount of 
definitional done in examples such as [15], but that experiment has not been 
done yet. 

11 Conclusion 

This paper has demonstrated that inductive assertion style proofs can be carried 
out in an operational semantics framework, without producing a verification con- 
dition generator or incurring proof obligations beyond those produced by such a 
tool. The key insight is that assertions attached to cut points in a program can 
be propagated by a tail-recursive function to create an alleged invariant. The 
proof that the alleged invariant is invariant under the state transition function 
produces the standard verification conditions. The invariance result can then 
be traded in for a partial correctness result stated in terms of the operational 
semantics, without requiring the construction of clocks or the counting of in- 
structions. 

No verification condition generator need be constructed. Given an operational 
semantics it is possible, more or less immediately, to perform inductive assertion 
style proofs of partial correctness theorems. 

The process of proving the step-wise invariance of the completed assertions 
“naturally” produces the verification conditions. 

This situation is attractive for three reasons. First, writing a verification 
condition generator for a realistic programming language like JVM bytecode is 
error-prone. For example, method invocation involves complicated non-syntactic 
issues like method resolution with respect to the object on which the method is 
invoked, as well as side-effects to many parts of the state including, possibly, the 
call frames of both the caller and the callee, the thread table (in the event that a 
thread is started), the heap (in the event of a synchronized method locking the 
object upon which it is invoked), and the class table (in the event of dynamic 
class loading). Coding this all in terms of formula transformation instead of 
state transformation is difficult. Second, when completed, the semantics of the 
language is encoded in the VCG process rather than as sentences in a logic. 
This encoding of the semantics makes it difficult to inspect. In our approach, 
the semantics is expressed explicitly in the logic so that it can be inspected. 
Indeed, it is possible to prove theorems about the semantics (not just theorems 
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about programs under the semantics). Finally, realistic VCGs contain simplifiers 
used to keep the generated proof obligations simple. These simplifiers are just 
theorems provers and must be trusted. In our approach, only one theorem prover 
is involved. It must be trusted but that trusted engine derives the verification 
conditions from the operational semantics and the user-supplied assertions. 
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Abstract. I develop a compositional theory of refinement for the branc h- 
ing time framework based on stuttering sim ulationand prove that if one 
system refines another, then a refinement map always exists. The ex- 
istence of refinement maps in the linear time framework was studied 
in an influential pap erby Abadi and Lamport. My interest in proving 
analogous results for the branc hingtime framework arises from the ob- 
servation that in the con text of mechanical verification, branc hingtime 
has some important adv an tages.By setting up the refinement problem 
in a way that differs from the Abadi and Lamport approach, I obtain a 
pro of of the existence of refinement maps (in the branc hingtime frame- 
work) that do es notdep endon any of the conditions found in the work of 
Abadi and Lamport e.g., machine closure, finite invisible nondetermin- 
ism, internal con tin uij^, the use of history and prophecy variables, etc. 

A direct consequence is that refinement maps always exist in the linear 
time framework, sub ject onlyto the use of prophecy-lite variables. 



1 Introduction 

Computing systems are ubiquitous, con trolling everything from cars and air- 
planes to financial markets and the distribution of information. Such systems 
tend to be very complicated and often con tain costly errors. One approach to 
dealing with this complexity is to specify a sequence of related systems, starting 
with an abstract system, the sp ecification, and ending with a concrete system, 
the implementation. One then proves that every pair of adjacent systems is re- 
lated, via a suitable, compositional notion of correctness, thereby establishing 
that the sp ecificationis correctly implemented. For example, we can imagine ver- 
ifying a netlist description of a pip elined microprocessor, the implementation, by 
relating it via a sequence of refinements to an instruction set lev el sp ecification — 
the assembly programmer’s view of the pro cessor. 

Two important concepts that notions of correctness must account for are: 

— Stuttering. Since the sp ecification is defined at a more abstract lev el than 
the implementation, notions of correctness should allow for stuttering^ here 
the implementation may require several steps before matching a single step 
of the sp ecification [14 ]. 
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— Refinement. The implementation may contain more state components and 
may use different data representations than the specification. Refinement 
maps are used to show how to view an implementation state as a specification 
state [1]. 

The classic paper on the topic by Abadi and Lamport [1], which has mo- 
tivated the work appearing in this paper, contains an in-depth discussion of 
these topics. The main idea is to use refinement maps to prove that systems 
have related infinite computations, by reasoning locally, about states and their 
successors, instead of globally, about infinite paths. Abadi and Lamport prove a 
theorem about when such refinement maps exist in the linear time framework, 
where the semantics of systems and properties correspond to sets of infinite 
sequences. 

My approach differs in that I work in the branching time framework, where 
the semantics of systems are given by sets of infinite trees. Even so, the results 
can be applied to the linear time framework, as I explain later. 

The theorem proved by Abadi and Lamport holds only under certain con- 
ditions. Briefly, they allow one to add history and prophecy variables to the 
implementation, they require that the implementation is machine closed, and 
they require that the specification has finite invisible nondeterminism and is in- 
ternally continuous. My theorems do not depend on these conditions, but there 
are important differences between the two approaches that are explored in depth 
later. 

There are two main reasons why I chose to work in the branching-time frame- 
work. The first is that in the simple case where one is dealing with finite-state 
systems, it makes sense to use algorithms that can check if one finite-state system 
refines another. For example, in [17] we use algorithms for deciding stuttering 
bisimulation to complete a proof of correctness for the alternating bit protocol 
(this is an infinite-state problem that was reduced to a finite-state problem using 
a theorem prover). The branching time notions of simulation and bisimulation, 
due to Milner and Park [18, 21], can be decided in polynomial time [20, 7]. 
In contrast, the corresponding linear time notions, trace equivalence and trace 
containment, are both PSPACE-complete problems [26]. 

Second, refinement maps allow one to show that one system simulates an- 
other. This is inherently a branching time notion which has the advantage of 
being structural and local. However, in order to use refinement maps in a lin- 
ear time setting other mechanisms are needed to, in essence, hide the branching 
structure of systems. Thus, we expect the branching time case to be simpler than 
the linear time case. Obvious questions arise. How much simpler? What condi- 
tions in the Abadi and Lamport theorem are there for this purpose? It turns out 
that by using only prophecy-like variables, which have the effect of destroying 
the branching structure of systems, we can get a completeness theorem for the 
linear time. 

Stuttering simulation is based on the notions of simulation and bisimula- 
tion, which have had a deep impact on how we think about specifications. The 
literature on this topic is vast and contains many fine surveys [23, 15, 6]. In ad- 
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dition, there have been various extensions of the Abadi and Lamport result [1], 
including [5, 9, 2, 8]. In related previous work, Namjoshi [19] gives a sound 
and complete proof rule for symmetric stuttering bisimulations which has heav- 
ily influenced my work; however, Namjoshi does not consider simulations and 
does not deal with reflnement. Stuttering bisimulations and the related notion 
of WEBs (Well-founded Equivalence Bisimulations) were used to link theorem 
proving and model checking and to mechanically verify the alternating bit pro- 
tocol in [17]. In [16], I proposed a notion of correctness for pipelined machines 
based on WEBs and I showed that the variant of the Burch and Dill notion of 
correctness [4] in [24, 25] can be satisfled by machines that deadlock. In addition, 
I used the ACL2 theorem prover [12, 11, 10] to automate much of the veriflca- 
tion. I also verifled variants of the pipelined machine including machines with 
exceptions, interrupts (which lead to non-determinism), and netlist (gate- level) 
descriptions and showed that my notion of correctness applies to these exten- 
sions. Many of the variant machines were verifled in stages, using the WEB 
compositional proof rule. Unfortunately, stuttering bisimulation and WEBs are 
often too strong a notion, just as trace equivalence is often too strong a notion in 
the linear time case. I expect stuttering simulation to be much more applicable, 
hence my interest in the topic. 

The paper is organized as follows. In section 2, 1 describe my notational con- 
ventions and review background material. In section 3, I develop a theory of 
reflnement based on stuttering simulation. In section 4, I discuss reflnement in 
the linear time framework and compare my work with that of Abadi and Lam- 
port; some readers may want to start by skimming this section first. I conclude 
in section 5. 



2 Notation and Mathematical Preliminaries 

N and oj both denote the natural numbers, i.e., {0,1,...}. The ordered pair 
whose first component is i and whose second component is j is denoted [f..j] 
denotes the closed interval {A:GN : i < k < j}; parentheses are used to denote 
open and half-open intervals, e.g., denotes the set {A: G N : i < k < j}- The 
disjoint union operator is denoted by l±). Cardinality of a set S is denoted by [S'!. 
V(S) denotes the powerset of S. Eunction application is sometimes denoted by an 
infix dot Eor any binary relation R: I abbreviate (s,w) G Rhj sRw, I write 
R(S) for the image of S under R (i.e., R(S) = {y : (3a ; : x G S : xRy)}), and R\a 
denotes R left-restricted to the set A (i.e., R\a = {( 0 ;b) : (aRb) A (a G A)}). 
The composition of binary relations R and T is denoted i?;T or T o R, i.e., 
R-,T = ToR = {(r,t) : (3x :: rRx A a;Tf)}. The muerse of binary relation 
R is denoted R~^ and is defined to be {(a, b) : bRa}. 

(Qx : r : b) denotes a quantified expression, where Q is the quantifier, x the 
bound variable, r the range of x (true if omitted), and b the body. I sometimes 
write (Qx G X : r : b) as an abbreviation for (Qx : x G X A r : b), 
where r is true if omitted, as before. From highest to lowest binding power, we 
have: parentheses, function application, binary relations (e.g., sBw), equality 
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(=) and membership (g), conjunction (A) and disjunction (V), implication (=^), 
and finally, binary equivalence (=). 

Spacing is used to reinforce binding: more space indicates lower binding. 

A binary relation, B C X xX, is reflexive if (Va; G X :: xBx). B is symmetric 
if (ix,y G X :: xBy ^ yBx). B is antisymmetric if (Va;,y G X :: xBy A 
yBx ^ X = y). B is transitive ii (\fx,y,z G X :: xBy A yBz ^ xBz). 
A binary relation is a preorder if it is reflexive and transitive. A preorder that 
is also symmetric is an equivalence relation. 

A finite sequence is a function from [0..n) for some natural number n. An 
infinite sequence is a function from N. When I write x G a, for a sequence cr, I 
mean that x is in the range of cr. A well-founded structure is a pair (W, <) where 
W is a set and < is a binary relation on W such that there are no infinitely 
decreasing sequences on W, with respect to <. I use < to compare natural 
numbers and ^ to compare ordinal numbers. 

A transition system (TS) is a structure (5', — L), where 5 is a set of states, 
— >C S' X S' is the transition relation, L is the labeling function: its domain is S 
and it tells us what is observable at a state. I also require that —4 is left-total: 
for every s G S, there is some u G S such that s — > u. Notice that a transition 
system is a labeled graph where the nodes are states and are labeled by L. 

A path (7 is a sequence of states such that for adjacent states s and u, s —r u. 
A path, cr, is a fullpath if it is infinite, fp.cr.s denotes that cr is a fullpath starting 
at state s and cr* denotes the suffix fullpath {a.i, cr(f + 1), . . .). I use the symbol 

for concatenation of paths where the left path is finite, e.g., a; ab = aab. 

Temporal logic was proposed as a formalism for specifying the correctness of 
computing systems in a landmark paper by Pnueli [22]. I assume that the reader 
is familiar with temporal logic. 

3 Stuttering Simulation Refinement 

Stuttering simulation depends on the notion of matching I now define. I start 
with an informal account. Given a relation B on a set S, we say that an infinite 
sequence cr (of elements from S) matches an infinite sequence S (of elements 
from S) if the sequences can be partitioned into non-empty, finite segments such 
that elements in related segments are related by B. For example, if the first 
segment of cr has three elements and the first segment of S has seven elements, 
then each of the three elements is related by B to each of the seven elements. I 
use matching, where the infinite sequences are fullpaths of a transition system, 
to define stuttering simulation. 

Definition 1. (match) Let i range over N. Let INC be the set of strictly in- 
creasing sequences of natural numbers starting at 0; formally, INC = { 77 : 77 : 
N -> N A 77.0 = 0 A (Vi G N :: 77.i < 77(7 -I- 1))}. The segment of 
an infinite sequence cr with respect to tt G INC, ^cr'‘, is given by the sequence 
(cr(77.i),...,cr(7r(i-|-l) - 1)). 

For B C S X S, TT,f G INC, i,j G N, and infinite sequences cr and S, I 
abbreviate (fis,w : s G **cr* A w G : sBw) by (^a^)B(^S^). 
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In addition: corr(B, a, = (Vz G N :: ( ^a^)B( ^S^)) and match{B, a, (5) = 

(37t, ^ G INC :: corr{B, a, tt, S, ^)) . 

Lemma 1. Given set S, B C S x S, and infinite sequences a and S, 

G INC :: corr(B, (7,77,6,^)) 

£ INC :: corr{B,a,Tr',S,^') A (Vi G N :: | | = 1 V |«'5*|=1)) 

The above lemma allows us to reason about segments using case analysis, 
where the three cases are: both segments are of length 1, the right segment is of 
length 1 and the left of length greater than 1, and the left segment is of length 
1 and the right of length greater than 1. 

3.1 Stuttering Simulation 

A relation on B C S' x S' where A4 = (S, — > , L) is a stuttering simulation, if 
for every s, w such that sBw, s and w are identically labeled and every fullpath 
starting at s can be matched by some fullpath starting at w. 

Definition 2. (Stuttering Simulation (STS)) B C S x S is a stuttering simula- 
tion on TS Ad = (S, , L) iff for all s, w such that sBw: 

(Stsl) L.s = L.w 

(Sts2) (Vcr : fp.n.s : (35 : fp.S.w : match(B,a,S))) 

Lemma 2. (BCC) ^ [match{B,cr,S) ^ match{C,cr,S)] 

Lemma 3. Let C he a set of STS’s on TS M, then G = (UB : B €: C : B) is an 
STS on M. 

Corollary 1 For every TS M, there is a greatest STS on M. 

Lemma 4. If R and S are STS’s, so is T = R\ S. 

Lemma 5. The reflexive, transitive closure of an STS is an STS. 

Theorem 1. Given TS A4, there is a greatest STS on M, which is a preorder. 

Theorem 2. Let B he a STS on M. and let sBw. For every ACTL* \X formula 
f, if M,w\= f then M,s\= f. 

3.2 Well-Founded Simulation 

In order to check that a relation is an STS, we have to show that infinite se- 
quences “match”. This can be problematic when using computer-aided verifi- 
cation techniques. I present the notion of a well-founded simulation to remedy 
this situation. To show that a relation is a well-founded simulation, we need 
only check local properties; this is analogous to proving program termination by 
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exhibiting a function that maps states into a well-founded relation and showing 
that the function decreases during every step of the program. As mentioned pre- 
viously, the intuition is that for every pair of states s, w that are related by an 
STS and u such that s —4 u, there are essentially three cases: either there is a u 
such that w — > v and u is related to v, or u is related to w, or there is a u such 
that w —4 V and s is related to v. In the last two cases, we must also ensure 
that we do not have an infinite sequence of states, each of which is related to 
a single state. This is where the well-founded relation comes in: we must show 
that in these cases there is an appropriate measure function into a well-founded 
relation that decreases. Formally, we have: 

Definition 3. (Well-Founded Simulation (WFS)) B C S x S is a well-founded 
simulation on TS At = {S, —*,L) iff: 

(Wfsl) (Vs, w&S: sBw : L.s = L.w) 

(Wfs2) There exists functions, rankt : S' x S' — >■ IF, rankl : S x S x S — >■ N, 
such that (IF, <) is well-founded, and 
{'is,u,w £ S : sBw A s—^u: 

(a) : w —4 v : uBv) V 

(\)) {uBw A rankt(u,w) < rankt (s,w)) V 

(c) : w — > V : sBv A rankl{v,s,u) < rankl{w,s,u))) 



3.3 Equivalence 

In this section, I show that well-founded simulation completely characterizes 
stuttering simulation. Thus, we can think of well-founded simulation as a sound 
and complete proof rule. 

Proposition 1 (Soundness) If B is a WFS, then it is an STS. 

Proof Let oBF, we need to show Stsl and Sts2. L.a = L.b since B is a WFS 
(Wfsl), thus Stsl holds. We show (Vcr : fp.a.a : (3d : fp.S.b : match{B,a,S))), 
namely that Sts2 holds. Suppose fp.a.a. We define fullpath S and increasing 
sequences tt, f recursively as follows: d.O = 6, tt.O = 0, ^.0 = 0. The idea is that 
from 'K.i,f.i,S(^.i) we can define Tr{i + l),^(i -I- 1), -I- 1) with ^a^, 

matching. □ 

We now prove that every STS is a WFS. For the proof, we have to exhibit 
the rank functions as per the definition of WFS. Here is a high-level overview. 

The value of rankt{s, w) is important only if sBw, as otherwise there are no 
restrictions required by the definition of WFS. If sBw, then consider the largest 
subtree of the computation tree rooted at s such that no node in the subtree 
matches a successor of w. The “rank” (a kind of height) of this subtree is the 
value of rankt{s,w). The “rank” of s is greater than the “rank” of any of its 
children in the tree, so case Wfs2b is satisfied. 

The value of rankl{w, s, u) is important only if sBw and s — > u, as otherwise 
there are no restrictions required by the definition of WFS. If sBw and s —4 u, 
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then rankl(w, s, u) is the length of the shortest path from w that matches s, u. In 
the case of Wfs2c, we can choose the next successor of w in this path to satisfy 
the condition. 

Given a TS = {S, —*,L), the notion of the computation tree rooted at a 
state s G S' is standard. It is the tree obtained by unfolding M starting from s 
and can be defined as follows. The nodes of the tree are finite sequences over S. 
The tree is defined to be the smallest tree satisfying the following. 

1. The root is (s). 

2. If (s, . . . , w) is a node and w — > v, then (s, . . . ,w, v) is a node whose parent 
is (s,...,w). 

Definition 4. (tree) Given an STS B, if ^{sBw), then tree(s,w) is the empty 
tree, otherwise tree{s,w) is the largest subtree of the computation tree rooted 
at s such that for every non-root node of the tree, {s,. . . ,x), we have that xBw 
and (Vu : w — > v : ^(xBv)). 

Lemma 6. Every path of tree{s,w) is finite. 

Since the child relation on nodes in trees is well-founded, we can recursively 
define a labeling function, /, that assigns an ordinal to nodes in the tree as 
follows: l.n = (Uc : c is a child of n : (l.c) + 1). This is the standard “rank” 
function encountered in set theory [13]. We use the convention that the label of 
a tree is the label of its root. 

Lemma 7. If jS'j :< k, where k is an infinite cardinal (i.e., uj ^ k) then for all 
s,w £ S, tree{s,w) is labeled with an ordinal of cardinality -< k. 

Lemma 8. If sBw,s —■* u,u£ tree(s,w) then l.tree{u,w) -< l.tree(s,w). 

Definition 5. (length) Given B, an STS, length{w,s,u) = 0 if ^(sBw) or 
-i(s — > u), otherwise length(w,s,u) is the length of the shortest initial seg- 
ment starting at w that matches (s,u). Formally: 

length{w,s,u) = {min cr,S,Tr,f : fp.cr.s A cr.l= u A fp.S.w A Tr,f £ INC A 
corr(B,a,7T,S,f) : j) 

As sBw and s —■* u, the above range is non-empty and length{w,s,u) £ N. 

Lemma 9. IfsBw,s — > u and^(5cF,S,Tr,f : fp-crs A cr.l = u A fp.S.w A TT,f £ 
INC : corr(B,a,'K,S,f) A = (w)), then : w — > v : length{v,s,u) < 
length{w,s,u) A sBv). 

Proposition 2 (Completeness) If B is an STS, then B is a WFS. 

Proof Wfsl follows from Stsl. Let W = (|5'|-l-a;)+. Note that -I- denotes cardinal 
arithmetic; we add oj to jS'j to guarantee that we have an infinite cardinal; 
is the successor cardinal to k. 

Glearly, (IT, is well-founded. Let rankt = l.tree and let rankl = length. 
Let sBw and s —4 u. There are three cases: 
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1. : w --4 V : uBv). By lemma 1, if (1) does not hold, then for every 
a,S,TT,^ such that fp.a.s Aa.l = u A fp.S.w Att,^ G INC A corr(B,a,'K,S,Q, 
either s marks the end of or w marks the end of ^(5°, but not both. 

2. (3cr, (5, 7T,^ : fp.a.s Aa.l = uAfp.6.wATT,f G INCA^S^ = (w) : corr{B,a,Tr,6,f^)) 
and (1) does not hold. This implies that | ^a^ \ > 1, uBw, and u G tree(s, tu); 
hence, rankt{u, w) -< rankt{s, w) by lemma 8. 

3. If (1) and (2) do not hold, we must have (5, tt, ^ : fp.a.s A a.l = u A 
fp.S.w Att,^ G INC : corr(B,a,TT,S,^) A = (w)). By lemma 9 and the 
definition of rankl, : w — > v : rankl{v,s,u) < rankl{w,s,u) A sBv). □ 

Theorem 3. (Equivalence) B is an STS iff B is a WFS. 

A consequence of the above theorem is that all of the properties proved for 
STSs carry over to WFSs; I use this fact freely, without reference, in the sequel. 

3.4 Refinement 

Up to this point, I have developed a theory for relating states. I now show how 
to apply the theory to transition systems. In this section, I define a notion of 
refinement and show that STSs can be used in a compositional fashion. For 
states s and w, I write s C u) to mean that there is an STS B such that sBw. 

By theorem 1, s C u) iff .sGw, where G is the greatest STS. I now lift this 
idea to transition systems. 

Definition 6. (Simulation Refinement) Let A4 = (S,—^,L), A4' = {S',—*' 
,L'), and r : S —t S'. We say that At is a simulation refinement of A4' with 
respect to refinement map r, written M Cr M', if there exists a relation, B, 
such that (Vs G S :: sB(r..s)) and B is an STS on the TS {S^ S', —* 1+) —*', C), 
where C.s = L'{s) for s an S' state and L.s = L'{r.s) otherwise. 

In the above definition, it helps to think of M' as the specification and M as 
the implementation. That At is a simulation refinement of M' with respect to 
r implies that every visible behavior of At (where what is visible depends on r) 
is a behavior of Af'. There are often other considerations, e.g., it might be that 
M and Af' have certain states that are “initial”. In this case one might wish to 
show that initial states in M are mapped to initial states in Af'. 

One has a great deal of flexibility in choosing refinement maps. The danger is 
that by choosing a complicated refinement map, one can bypass the verification 
problem all together. To make this point clear, let PRIME be the system whose 
single behavior is the sequence of primes and let NAT be the system whose single 
behavior is the sequence of natural numbers. We do not consider NAT to be an 
implementation of PRIME, but using the refinement map from NAT to PRIME 
that maps i to the i*^ prime, we can indeed prove the peculiar theorem that 
NAT is a refinement of PRIME. The moral is that we must be careful to not 
bypass the verification problem with the use of such refinement maps. Simple 
refinement maps with a clear relationship between implementation states and 
their image under the map are best. The reason we do not place restrictions 
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on refinement maps is that it is not a priori apparent what the “reasonable” 
relationships between implementation states and specification states might be, 
e.g., suppose that the specification system represents numbers in decimal but 
the implementation system represents numbers in binary, or that numbers in 
the specification are spread across several registers in the implementation, and 
so on. Often refinement maps are especially clear, which makes it easy to check 
that they are in fact appropriate. Suppose that associated with states is a set of 
variables, each of a particular type. Furthermore, suppose that the variables in 
the implementation are a superset of the variables in the specification and that 
the refinement map just hides the implementation variables that do not appear 
in the specification. Then, it is clear that the refinement map is a reasonable 
one. More precisely, given TS Af = {S, — >,!/), if L has the following structure, 
we say that M is typed. 

Let VARS be a set and let TYPE be a function whose domain is VARS. 
Think of VARS as the variables of TS A4, where TYPE gives the type of the 
variables. For all s G 5, let L.s be a function from VARS such that L.s.v G 
TYPE.v. The lemma below shows why the appropriateness of refinement maps 
that hide some of the implementation variables is easy to ascertain. 

Lemma 10. If A4 = {S,—^,L) M' = {S',—^',L'), both M. and M' are 
typed TSs, and L'{r.s) = L.s\v, then for every pair of states s,r.s such that 
s G S, and every ACTL* \ X formula, f, built out of expressions that only 
depend on variables in V, we have A4',r.s |= / A4,s |= /. 

Lemma 11. If B is an STS on TS M = {S D Si U S 2 , —*,L), SiC\S 2 = 0, 
states in S\ can only reach states in Si, and states in S 2 can only reach states 
in S 2 , then B = {(si,S 2 ) : si G Si A S 2 G S '2 A S 1 SS 2 } is an STS on M.. 

Theorem 4. (Composition) If M A4' and A4' Og A4" then M V^-g M". 

4 The Linear Time Case 

The theorem on the existence of refinement maps in the previous section does 
not apply to the linear time framework because simulation is a stronger property 
than trace containment. However, note that if we destroy the branching structure 
of transition system A4 to obtain transition system A4', then A4' Af iff the 
set of infinite sequences of A4, labeled by r, is a subset of the set of sequences 
of Af. We can destroy the branching structure of A4 by using an oracle variable 
to record values for every non-deterministic choice made along an infinite path 
in the computation tree of A4. We have thus sketched a proof of the existence 
of refinement maps in the linear time framework. 

Theorem 5. If the set of traces of A4 is a subset of the traces of Af, then there 
exists A4', a transition system obtained from A4 by adding an oracle variable, 
and a refinement map r such that A4' Af. 
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I now review the work of Abadi and Lamport on the existence of refinement 
maps. The review addresses the essential points, but is necessarily concise and 
readers are urged to read the full paper. I then present several examples, taken 
from Abadi and Lamport, that are used to justify the conditions appearing in 
their theorem. At the end of this section, I compare the two approaches. 



4.1 Review of Abadi and Lamport Results 

I begin by reviewing some initial definitions. A behavior is an infinite sequence 
and a property is a set of behaviors closed under finite stuttering. A specification 
is a (possibly infinite) state machine, consisting of externally visible components 
and internal components, and a supplementary property to represent fairness 
constraints. The complete property of a state machine is obtained by closing 
the set of behaviors allowed by the machine under (possibly infinite) stuttering. 
The externally visible property of a state machine is obtained by projecting the 
externally visible components of the complete property of the state machine. 
The property defined by a specification is obtained by intersecting the complete 
property of its state machine with the supplementary property. The externally 
visible property of a specification is obtained by projecting the externally visible 
components of the property of the specification. 

We say that I, a “concrete” specification (the /mplementation), implements 
S, an “abstract” specification (the Specification) if every externally visible be- 
havior of I is also a behavior of S. Proving that I implements S can require 
reasoning about arbitrary sequences because one has to show that if I admits 
the behavior {{cq^zq), {e\,z \), . . . , (e„,2;„), . . .), where the e* correspond to the 
externally visible components and the Zi to the internal components, then S 
admits the behavior {{eo, j/o), (ei, j/i), . . . , (e„, j/„), . . .). Notice that can de- 
pend upon the entire sequence {{eo,zo),{ei,z\),{e2,Z2),- ■ -), which can make 
the proof difficult. We prefer to avoid such global reasoning and would rather 
reason locally e.g., if there is a function / such that {ci,yi) = f(ei,Zi), it can be 
used to prove that I preserves the safety property of S by reasoning about pairs 
of states instead of arbitrary sequences of states. If such a function also pre- 
serves liveness, it is called a refinement mapping and Abadi and Lamport prove 
the following completeness theorem, showing under what conditions refinement 
mappings exist. 

Theorem 6. If the machine-closed specification I implements S, a specification 
that has finite invisible nondeterminism and is internally continuous, then there 
is a specification 7^ obtained from I by adding a history variable and a specifi- 
cation obtained from by adding a prophecy variable such that there exists 
a refinement mapping from to S. 

The above theorem depends on various conditions, which I now explain. We 
say that a specification I is machine-closed if the supplementary property of I 
does not specify any safety property not already specified by the state machine 
of 7. A specification S has finite invisible nondeterminism if for every finite 
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prefix of every behavior allowed by S, whenever an infinite nondeterministic 
choice is made, all but a finite part of that choice is immediately revealed in 
the externally visible components of the resulting state. A specification S is 
internally continuous if for every behavior, if the behavior is not allowed by the 
specification, this can be determined by looking at the externally visible part of 
the behavior and some finite portion of the complete behavior. 

A history variable is used to extend the state space of a specification with 
a component that records past information, but in a way that does not affect 
the externally visible behaviors of the specification. Abadi and Lamport give 
five conditions that must be satisfied in order to show that if a specification 
is obtained from a specification S by adding a history variable, then the two 
specifications define the same externally visible property. A prophecy variable is 
the dual of a history variable. Instead of recording past information, it guesses 
future information. Abadi and Lamport give six conditions that must be satisfied 
in order to show that if a specification is obtained from a specification S by 
adding a prophecy variable, then the two specifications define the same externally 
visible property. 



4.2 Examples Due to Abadi and Lamport 

This section contains several examples that Abadi and Lamport use to explain 
the conditions found in their completeness theorem. After the examples are in- 
troduced, I show how they can be handled using in my framework. 

In the first example, system <S is a three-bit clock, where only the low-order 
bit is externally visible and system J is a one-bit clock. I implements <S since 
they have the same traces (up to stuttering). However, no refinement mapping 
can be used to show this because there is no way to define the internal state of <S: 
consider an arbitrary refinement mapping, r, and suppose that r((0)) = (0, j/o) 
and r((l)) = (l,yi), then either (0,yo) does not transit to (l,yi) or (l,yi) does 
not transit to (0,yo)- This is one reason for introducing history variables and 
they are used to resolve the dilemma as follows. A history variable is added to I 
and the variable “remembers” what I did in the past. The result is that the state 
space of I is expanded so that there are enough states to define an appropriate 
refinement mapping. 

Using the approach outlined in this paper, we find that history variables are 
not needed as we can define a refinement map that maps the state in I whose 
counter is 0 to any state in <S whose low-order bit is 0 and similarly with the other 
state in I. The equivalence relation that relates states with the same low-order 
bit in the disjoint union of the two systems is a stuttering simulation. 

The second example is used to motivate the need for prophecy variables. 
System S chooses ten values non-deterministically and displays each in turn, 
whereas system I chooses each value as it is displayed. I implements <S since 
they have the same traces, but there is no refinement mapping that can be used 
to show this, as should be clear. This example highlights that proofs based on 
refinement mappings are based on simulation, a branching time notion. Thus, 
when I is not a stuttering simulation of S, one cannot directly use refinement 
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mappings to prove that T implements S (in the linear time sense). This is one rea- 
son for introducing prophecy variables and they are used to resolve the dilemma 
as follows. A prophecy variable is added to X and the variable “guesses” what 
X will decide to do in the future. There is now a refinement map, based on 
this prophecy variable, that can be used to show that X implements <S. What is 
happening is that the prophecy variables allow one to push all of the branching 
in the computation tree of X up to the root, thereby destroying the branching 
structure of X. 

This example shows why oracle variables are used in theorem 5. Note that 
from the branching point of view X does not implement <S, e.^., from the ini- 
tial state in I, there is a successor that has more than one possible future, a 
branching-time expressible property that does not hold in the initial state of <S. 
It seems that any refinement-based approach will need a mechanism for dealing 
with this issue, whether it is by destroying the branching structure of implemen- 
tations, by adding branching structure to specifications, or by some combination 
thereof. 

The third example shows why a prophecy variable is needed to slow down 
an implementation that runs faster than a specification, even though the spec- 
ification is just stuttering. Both X and <S specify clocks in which the hours and 
minutes are externally visible, whereas the seconds are internal. Furthermore, 
X increments the clock by one second, whereas S increments the clock by ten 
seconds. Both X and <S have the same externally visible behaviors and proving 
that <S implements X using refinement mappings is easy. However, there is no 
way to show that X implements <S, because there is a behavior of <S such that 
the minute hand changes every six steps, but any behavior of X requires at least 
sixty steps between minute hand changes. 

In my formulation, the implementation is allowed to run faster than the 
specification, as we can both add and remove stuttering steps, thus it is easy to 
deal with the third example. 

Abadi and Lamport present examples showing why the conditions of finite 
invisible nondeterminism and internal continuity are required. The examples are 
similar in that the implementation, I, has the same externally visible behaviors 
as the specification, <S, but X has a richer branching structure than <S, i.e., S 
is a simulation refinement of X, but not the other way. As we have seen in 
the second example, above, prophecy variables can be used to deal with this 
problem. However, in these examples there are states in X that are related to 
an infinite number of states in <S, and Abadi and Lamport’s prophecy variables 
cannot be used in this case (see their paper for the full details). To summarize, 
the conditions of internal continuity and finite invisible nondeterminism in the 
completeness theorem of Abadi and Lamport can be traced to the branching 
structure of the systems involved. 

Oracle variables can be used in my approach to deal with these examples. 
The intuition is that oracle variables allow us to quantify over every possible non- 
deterministic choice and can be used to transform X into a linear time equivalent 
system in which all nondeterministic choices have been made at the onset. 




316 



P. Manolios 



4.3 Comparison with the Approach of Abadi and Lamport 

There are various differences between my approach and that of Abadi and Lam- 
port. A major difference is that I deal with branching time notions because in the 
context of mechanical verification they provide certain advantages, as outlined 
above. However, in order to simplify the comparison, in this section I consider 
only the linear time aspects of my results. 

There are differences in how stuttering is dealt with; namely, Abadi and Lam- 
port allow infinite stuttering, whereas I do not. Consider the example of pipelined 
machine verification. Using the Abadi and Lamport approach, we would define 
the instruction set architecture using a state machine, say where every com- 
ponent is externally visible. By definition, the property generated by the state 
machine includes infinite stuttering, e.g., it includes the behavior where noth- 
ing happens. Thus, a supplementary property would be used to rule out such 
behaviors by requiring that non-stuttering steps are eventually taken, a live- 
ness property. In contrast, in my approach, every step of the transition system 
modeling the instruction set architecture corresponds to the execution of an 
instruction, with the stuttering being handled by the definition of stuttering 
simulation. Notice that no supplementary property is required. In addition, the 
condition that a pipelined machine makes progress is now a safety property, be- 
cause the number of steps required is bounded by the number of stages in the 
pipeline [16]. 

Lamport and Abadi require that systems have the same externally visible 
states. They make the point that one cannot say whether the value 11111100 
corresponds to —3 without knowing how to interpret a sequence of bits as an 
integer. They go on to say that given such an interpretation, they can trans- 
late the externally visible states to the appropriate representation. In my case, 
instead of having a separate interpretation phase, I allow refinement maps to 
alter the labels of states directly. I have found that in practice this extra power 
is necessary. For example, when proving that a pipelined machine implements 
the instruction set architecture, I have used refinement maps that either mod- 
ify the value of the program counter (when using my “commit” approach to 
correctness) or modify the register file and memory (using the Burch and Dill 
“flushing” approach to correctness) [16]. The point is that when using my com- 
mit approach to correctness, if we consider the program counter to be externally 
visible then we cannot use the Abadi and Lamport approach to prove that a 
pipelined machine implements the instruction set architecture. Similarly, when 
using the Burch and Dill approach, if we consider the register file or memory to 
be externally visible, then we cannot use the Abadi and Lamport approach to 
prove that a pipelined machine implements the instruction set architecture. 

The refinement mappings of Abadi and Lamport are required to preserve 
the supplementary property of the specification. As they point out, this is not a 
local condition, but one can apply local methods such as well-founded induction 
for the proof. Unfortunately, they do not provide any guidance on constructing 
such arguments. In my case, the proof of proposition 2 (if B is an STS, then 
B is a WFS) shows how to construct the appropriate well-founded relations 
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and measure functions, rankt and rankl. The proof also shows that two measure 
functions, one from pairs of states and one from triples of states to the naturals, 
are enough regardless of the transition systems involved. 

Finally, my theorems are stronger than the ones given by Abadi and Lam- 
port. For example, they show that even when <S is not internally continuous a 
refinement map exists to show that I satisfies the safety property specified by 
S. They continue “We do not know if anything can be said about proving arbi- 
trary liveness properties.” Since my refinement theorems apply to any systems, a 
simple corollary is that, with my approach, refinement maps can always be used 
to prove both safety and liveness properties. This is something that we used 
in [17] where we used theorem proving to reduce an infinite-state system to a 
finite-state system in such a way that stuttering-insensitive properties, including 
liveness, were preserved. We then model checked the reduced system and were 
able to lift the results to the original system. 

5 Conclusions 

I have introduced compositional notions of refinement for stuttering simulation. 
I have shown that if one system refines another in the branching time framework, 
then a refinement map always exists, without relying on any of the conditions 
present in the approach taken by Abadi and Lamport, e.g., machine closure, finite 
invisible nondeterminism, internally continuity, the use of history and prophecy 
variables, etc. I also showed that refinement maps always exist in the linear time 
framework, subject only to the use of oracle variables. 

My main motivation is the mechanical verification of systems. Notions of 
refinement based on stuttering bisimulation have proved useful for this pur- 
pose [17, 16]. However, stuttering bisimulation is applicable only in limited con- 
texts, as usually specifications contain more nondeterminism than implementa- 
tions. Thus, I expect that stuttering simulation will turn out to be more useful 
than stuttering bisimulation. 
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Abstract. As of version 2.7, the ACL2 theorem prover has been ex- 
tended to automatically verify sets of polynomial inequalities that in- 
clude nonlinear relationships. In this paper we describe our mechaniza- 
tion of linear and nonlinear arithmetic in ACL2. The nonlinear arith- 
metic procedure operates in cooperation with the pre-existing ACL2 lin- 
ear arithmetic decision procedure. It extends what can be automatically 
verified with ACL2, thereby eliminating the need for certain types of 
rules in ACL2’s database while simultaneously increasing the perfor- 
mance of the ACL2 system when verifying arithmetic conjectures. The 
resulting system lessens the human effort required to construct a large 
arithmetic proof by reducing the number of intermediate lemmas that 
must be proven to verify a desired theorem. 



1 Introduction 

Mechanical theorem proving or proof checking systems offer a rigorous methodol- 
ogy with which to structure and check proofs. Each such system offers a different 
degree of automation - directly affecting its capability and ease of use. We have 
extended the ACL2 theorem proving system [7,8,9] with an automated verifi- 
cation procedure that enhances the linear arithmetic decision procedure. ACL2 
can now more easily verify sets of inequalities containing nonlinear arithmetic 
relationships. 

In this paper we describe our mechanization of linear and nonlinear arith- 
metic in ACL2. Before doing so, we briefly describe the theory behind the pro- 
cedures and provide a couple of trivial examples of their use. The procedures 
operate on inequalities over the rationals. These inequalities can be combined 
by cross-multiplication and addition to permit the deduction of an additional 
inequality. For example, if 0 < polyl and 0 < poly2, and c and d are positive 
rational constants, then 0 < c • polyl + d ■ poly2. Here, we are use two facts: 
multiplication by a positive rational constant does not change the sign of a poly- 
nomial and the sum of two positive polynomials is itself positive. This is linear 
arithmetic. We also have that 0 < c- polyl ■ poly 2. In this nonlinear case, we are 
using the fact that the product of two positive polynomials is itself positive. 

Now suppose we want to prove 

3-x-|-7-a<4 A 3<2-a; a < 0. 
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To do this, we assume the two hypotheses and the negation of the conclusion, 
and look for a contradiction. We therefore start with the three inequalities: 



0 — 3 * X H — 7 ‘ a 4 l 


(1) 


0 <C 2 * X H — 3 


(2) 


0 < a. 


(3) 



We cross- multiply and add the first two - that is, multiply inequality (1) by two 
and inequality (2) by three, and then add the respective sides. This yields 

0<-14-a-k-l. (4) 

Note that the new inequality does not mention x. If we choose two inequalities 
with the same leading term and leading coefficients of opposite sign, we can 
generate an inequality in which that leading term is not present. This is the 
general strategy employed by the linear arithmetic decision procedure. 

If we next cross- multiply and add inequality (4) with inequality (3), we get 

0 < -1, (5) 

a false polynomial. We have, therefore, proved our theorem. 

This process illustrated above of cross-multiplying and adding two inequali- 
ties will be referred to as “cancelling” the two inequalities. We shall refer to such 
obviously false inequalities as (5) as “contradictions,” and speak of any process 
that results in one of these as “generating a contradiction.” 

Next, suppose that we have the three assumptions 



3-x-y + 7- a<4: or 


0<— 3*x*yH — 7*a + 4 


(6) 


3 < 2 • X or 


0 2 * X H — 3 


(7) 


1 < y or 


0 < y + -1, 


(8) 



and we wish to prove that a < 0. We proceed by assuming the negation of our 
goal, 0 <= a, and looking for a contradiction. 

Note that in this case no two inequalities have a leading term in common. In 
this situation there are no cancellations to perform. However, (6) has a product 
as its leading term, x ■ y, and for each of the factors of that product, x and y, 
there is an inequality which has such a factor as a leading term. When nonlinear 
arithmetic is enabled, ACL2 will multiply (7) and (8), obtaining 

0 < 2 ■ X ■ y -\ — 3 • y H — 2 • x -I- 3. (9) 

The addition of this polynomial will allow cancellation to continue^ and, in 
this case, we will prove our goal. Thus, just as ACL2 adds two polynomials when 
they have the same largest unknown of opposite signs in order to create a new 
smaller polynomial, ACL2 can now multiply polynomials when the product of 
their largest unknowns is itself the largest unknown of another polynomial. 

^ Inequality 9 can be canceled with 6. The result can be canceled with 8, and so on. 

The final cancellation will be with the negation of our goal, 0 <= a. 
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1.1 Related Work and Plan of the Paper 

It is often desirable to verify the correct operation of computer hardware or soft- 
ware. These operations may involve arithmetic, as in the floating-point hardware 
of a modern microprocessor or pointer arithmetic in a C program. 

Several approaches to automating the verification of arithmetic lemmas have 
been tried. Great progress has been made as is illustrated by the many substan- 
tial proofs recently completed in PVS, HOL, and ACL2 [10,5,13]. The existing 
state-of-the-art is, however, not sufficient. The level of user expertise and effort 
required for the above-mentioned work is too high. 

One of the primary difficulties encountered has been the fact that the formu- 
lae to be proved are rarely limited to just the four basic arithmetic operations, 
-L, — , *, and /, but often involve diverse semantic constructs or, at the least, 
user-defined functions. Theorem provers, therefore, cannot limit themselves to 
“pure” arithmetic but must work with combinations of theories. 

Our approach has developed from an engineering, results-oriented perspec- 
tive, and we have therefore concentrated on decreasing the user’s effort for the 
types of lemmas we see ACL2 users attempting to prove. Others have taken a 
more theoretical approach, whereby they can guarantee algorithmic complete- 
ness^ over an exactly specified domain. 

Several groups have built such systems by combining small-domain specific 
provers. Nelson and Oppen [11] and Shostak [14] describe frameworks with which 
one can combine separate existing decision procedures into one larger procedure. 
This work has been extended by others; e.g., Kapur [6] and SRI [12]. While this 
approach has some nice properties, such as completeness and efficiency, it can 
be somewhat limiting. Some of these limitations arise because the procedures to 
be integrated are treated as fixed black-boxes. Armando and Ranise [1] describe 
a method for augmenting the black-boxes. Other limitations arise from concerns 
over efficiency. Harrison [4] explores the use of a full decision procedure over the 
reals and discusses its desirability. 

We build on earlier work by Boyer and Moore [2] and share a common de- 
sign philosophy with theirs. We regard ACL2’s various procedures as a mutually 
recursive nest of functions, and have tuned both the interfaces and internals of 
these procedures using feedback from users to guide the process. It is also simi- 
lar to work by Cyrluk and Kapur [3] ; they too were concerned with augmenting 
existing linear arithmetic decision procedures to handle nonlinear inequalities, 
and their design was also driven by engineering rather than theoretical concerns. 
We have the advantage of possessing much faster computers than were available 
at that time and believe that the time has come to reexamine the feasibility 
of more ambitious, but still incomplete algorithms, for handling nonlinear arith- 

^ A decision procedure is said to be complete if it always returns a (correct) answer 
when asked to verify a true theorem. An incomplete procedure, by contrast, may 
return an “I don’t know” or even not return at all. 
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metic^. In this paper we present our first attempt at fully integrating a nonlinear 
arithmetic semi-decision procedure into ACL2. We present merely an outline of 
our work, and do not discuss nor even mention many of the heuristics that we 
have employed to limit and guide our algorithms. 

We provide the required background including the definitions of polys, pots, 
and labels, as well as a short discussion of type reasoning and linearization. 
Thereafter we describe the subprocedures that make up the linear and nonlinear 
arithmetic procedures. The linear arithmetic procedure consists of two nested 
loops. The innermost of these, the linear arithmetic decision procedure (described 
in Section 2) is responsible for adding inequalities to the pot-list. In Section 3 
we present linear arithmetic’s outer loop, the linear lemmas procedure which 
attempts to gather additional inequalities in order to allow further cancellations. 
The nonlinear arithmetic procedure consists of three nested loops. The innermost 
is the same as for linear arithmetic. The next, described in Section 4, is an 
augmentation of that described in Section 3. Nonlinear arithmetic’s outer loop 
is presented in Section 5. We conclude with a few remarks about the labor that 
can now be saved. 



1.2 Polys, Pots, Pot-lists, and Labels 

The procedures we will describe here operate on polynomial inequalities over the 
rationals. A “polynomial” is a sum of terms, each of which is either a rational 
constant or the product of a rational constant and an “unknown.” An example 
polynomial is 3-x-l — 7/2-0-I-2; here x and a are the unknowns. The unknowns, 
however, need not be variable symbols; e.g., |a;|, a:", or f{x, y) may be used as 
unknowns. Thus, — 3 • |x| -I- a is also a polynomial. 

A “polynomial inequality,” or a “poly” for short, is an inequality (either < or 
<) between 0 and a polynomial; e.g., 0 < 3-x-l — 7/2-0-I-2 and 0 < — 3 • |x| -I- a 
are polys. We refer to obviously false polys such as 0 < 0 as “contradictions.” 

Polys are stored in groups called “pots.” All the polys with the same largest 
unknown^ are stored in a single pot, which is said to be “labeled by” or “about” 
that unknown. These pots are further divided into two compartments - one 
for “positive” polys (with a positive leading coefficient) and one for “negative” 
polys (with a negative leading coefficient). A pot represents the conjunction of 
the polys in it. 

The pots are stored in a “pot-list,” which represents the conjunction of the 
pots in it. An example^ is: 

® One simple example of incompleteness is our inability to (automatically) prove x-x ^ 
2. If we were operating over the reals, x = y/2 would be a solution, but recall that 
we are operating over the rationals. The authors have not done a study of how 
to demarcate the class of formulas on which our algorithms succeed or fail. They 
are, however, unaware of any examples which “should” be proveable using these 
techniques but which are not proveable for reasons other than limiting heuristics. 
The order used here is basically lexicographic, considering number of variables first, 
number of function symbols second, and alphabetical order last. 

® Note that there are cancellations which can be performed. 
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label positives negatives 

b 0 < b-\ — 1-oH — 1 

y) 0 < f{x, y) 0 < -1 • f{x, y) + -l-b+a 

We refer to the b and f{x, y) above as their pots’ “labels.” 

The procedures that we will describe here all take, among other arguments, 
a pot-list and a list of polys to be added to the pot-list. They return either an 
augmented pot-list or a contradiction (a false poly) - the latter case indicating 
success as in Section 1. 

1.3 Type Reasoning and Polys from Type-Set 

We shall treat type reasoning - carried out by calling the ACL2 function 
type-set - as something of a black-box. For present purposes only a few things 
need be known about it. 

First, we use it here to quickly answer the question “To what arithmetic 
category does this expression belong?” where the possible answers are zero,® 
positive integer, negative integer, positive ratio, negative ratio, or combinations 
thereof such as non-negative rational. 

Second, we can sometimes form polys about an expression based on the 
answer given. For example, if x is said to be a nonpositive rational, we can 
create the poly 0 < — 1 • a; from that information. We refer to this mechanism as 
“creating polys from type-set.” 

Third, although type-set’s reasoning abilities are fairly limited, they can be 
extended through the use of type-prescription rules. ACL2 comes with some of 
these already built in including rules about the basic arithmetic functions such 
as that a: • y is a positive rational if both x and y are. This rule can, via the 
above-mentioned mechanism of creating polys from type-set, provide a small 
amount of nonlinear reasoning to the linear arithmetic procedures. We shall see 
this shortly. 

1.4 Linearization 

Linearization is the process of converting an ACL2 term into one or more polys. 
We note the following: 

1. An equality can be expressed as a conjunction of two inequalities; a; = y is 
true if and only if both x < y and y < x are true.^. 

2. We normalize polys so that their leading coefficient is +/—1. 

3. Consider the ACL2 term® (< x y). If we know that both x and y are in- 
tegers, we can assume that we are linearizing (<= (+ x 1) y) instead, and 

® A category with only one member. 

^ Note that the negation of an eqnality can similarly be expressed as a disjunction of 
two ineqnalities. We do not address this further in the present paper except to say 
that ACL2 does handle such situations. 

® ACL2 terms are a subset of Lisp expressions, and therefore use a Lisp-style prefix 
syntax. 
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SO convert (< x y) to the poly 0<j/H — l-a;H — 1 rather than the weaker 
0 < y-\ — 1-x. We shall refer to this as “the 1+ trick.” This is the only place 
in which the procedures described here take advantage of the discreteness of 
the integers. 

2 Linear Arithmetic 

In this section, we describe the innermost loop of ACL2’s arithmetic procedures. 
The algorithm described in this section is a decision procedure for linear arith- 
metic over the rationals. We later refer to this as the linear arithmetic algorithm. 

As in the examples of Section 1, our goal is to derive a contradiction. In order 
to do so, all of the unknowns of a poly must be eliminated by cancellation. We 
can choose to eliminate them in any order, but we eliminate the first. That is, 
two polys are canceled against each other only when their largest unknowns are 
the same and have coefficients of opposite signs. Note that this occurs precisely 
when two polys are (or will be) in opposite sides of the same pot. 

2.1 The Linear Arithmetic Algorithm 

We start with a (possibly empty) pot-list and a list of polys to be added to it. 
We repeat the following until we reach a fixed point. 

1. For each poly to be added: 

find its pot (the one whose label matches the poly’s largest un- 
known), if there is one, or make a new one. Add the poly to this 
pot and cancel the new poly with any polys of the opposite sign. 

If this generates a contradiction, quit and return the contradiction; 
otherwise set any new polys aside. 

2. If there were any new polys set aside in step 1, go back to step 1 with the 
new polys. Otherwise, go on to step 3. 

3. For each pot that is new or has been changed by having polys added to it in 
step 1, try to create a poly from type-set about the label of the changed 
pot. Collect any such newly created polys and return to step 1 with them. 

2.2 An Example 

Suppose that we want to prove 

integer a A integer b A 0<a A a < b a-|-l<a-5-|-6 



As before, we assume the hypotheses and the negation of the conclusion, 
and look for a contradiction. Since a and b are assumed to be integers, the 
linearization of a < 6 is 0 < 6-1 — 1-a-l — 1. Similarly a • 6 is known to be an 
integer since a and b are, so the linearization of the negation ofa-|-l < a-6-|-6is 
0 < — l-a-6-| — 1-6-1- a. Both of these linearizations used the 1-1- trick. Finally, 
the linearization of 0 < a is just 0 < a. 
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Since no cancellations can be performed between these three polys, executing 
steps 1 and 2 above results in the pot-list 

label positives negatives 

a 0 < a 

b 0 < b-\ — 1-a-l — 1 

a-b 0<— l-a-6-l — 1-6-1- a. 

In step 3 we create three polys from type-set. There are three new (and there- 
fore changed) pots, a, 6, and a-b. Type-set knows that a is a nonnegative integer 
and that 6 is a positive integer® and so we create the polys 0 < a and 0 < 6. 
As mentioned in 1.3, type-set therefore also knows that a • 6 is a nonnegative 
integer, and so we create the poly 0 < a • 6. This small amount of nonlinear 
reasoning has long been built into ACL2. Note that we used the pot labels to 
guide our search for additional polys. 

When we add these to the pot-list, after executing step 1 once, we get 

label positives negatives 

a 0 < a 

6 0 < 6 

0<6-l — 1-a-l — 1 

a-b 0<a-b 0<— l-a-6-| — l-6-|-a 

with the poly 0 < —1 - 6 -I- a having been set aside. This poly is the result of 
cancelling the two polys in the a-b pot.^® Upon adding it and canceling the polys 
in the 6 pot (executing steps 2 and 1 again), we get the contradiction 0 < — 1 
and our lemma is proved. 

3 Linear Lemmas 

Prior to version 2.7, ACL2’s arithmetic procedure encompassed little more than 
is described in this section. It is still the standard behavior of ACL2 when non- 
linear arithmetic is disabled. Note that this procedure is not complete. 

Suppose that the procedure described above does not produce a contradic- 
tion but instead yields a set of nontrivial polys. A contradiction might still be 
generated if we could add to the set some additional polys which allow further 
cancellation. That is where linear lemmas come in. Linear lemmas are more gen- 
eral and powerful than polys from type-set. (An example follows shortly.) When 
the set of polys has stabilized under the procedure described above and no con- 
tradiction has been produced, we form a list of the labels of any newly created 
pots and search the database of linear rules for ones that pattern match with a 
pot label. For each rule found, if we are able to relieve its hypotheses, we add its 

® The variable a is nonnegative by hypothesis, and since b is strictly greater than a, b 
must be positive. This is about as complicated as type-reasoning gets. 

Note that cancellation does not remove any polys. We augment, but never diminish, 
the pot-list. 
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conclusion to the pot-list (using the above linear arithmetic algorithm) in the 
hope that this will allow further cancellations to proceed. Just as for polys from 
type-set, we are using pot labels to guide our search for additional polys. Such 
labels, recall, correspond to the unknowns that are candidates for cancellation. 

3.1 The Linear Lemmas Algorithm 

As before, we start with a list of polys and a (possibly empty) pot list. We repeat 
the following until we reach a fixed point or are interrupted by the user aborting 
the proof attempt. 

1. Add the polys with the linear arithmetic algorithm as described in section 
2.1; if no contradiction was generated, go on to step 2. 

2. Make a list of the labels from any new pots created in step 1 (or 3). If there 
aren’t any, quit and return the pot-list; otherwise, go on to step 3. 

3. For each item in this list and for each applicable linear- lemma: 

If we can relieve the lemma’s hypotheses, add the concluding poly(s) 
to the pot-list as described in section 2.1. 

4. If a contradiction was generated, quit and return it. Otherwise, go back to 
step 2. 

3.2 An Example 

Suppose that we are given the following linear lemma,^^ expt-lemma, about x” 
1 < a: A integer n A 1 < n x < x^, 

and that we wish to prove 

2 < X A integer n A l<n A a < x + b a < x"' + b. 

After linearizing the inequalities among the hypotheses and the negation of 
the conclusion and adding them to the empty pot-list (step 1) we get 



label 


positives 


negatives 


b 




0 < - 


1 • 6 -I- a 


71 


1 < n 






X 


0<x-|-6-| — 1-a 








0 < X J- — 2 








0 < x” 


0 < - 


1 • x" -b -1 • 6 -b a 



Note that the poly 0 < x" was created from type-set about the pot-label x". 

Note that the conclusion of this lemma does not encode a type such as positive 
integer, and so could not be expressed as a type-prescription rule. We also mention 
here that the hypotheses of a linear lemma may be relieved by general purpose 
rewriting and (recursively) linear and nonlinear arithmetic, while a type-prescription 
rule’s hypotheses must be relieved by type-reasoning only. 
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In step 2, we note that there were four new pots created in step 1, and in 
step 3 we will eventually find expt-lemma. We sketch here how we relieve the 
first hypothesis, 1 < x. Rewriting cannot do anything with this, so we linearize 
the negation of the hypothesis (yielding 0 < — 1 • x + 1) and recursively call the 
very procedure we are describing. In this situation we do not start with an empty 
pot-list. This poly will be added to the x pot. Upon cancellation with 0 < x+—2, 
we get the contradiction 0 < — 1, and the hypothesis has been relieved. 

We therefore add the linearization of the conclusion of expt-lemma, 0 < 
x" H — 1 • X, to the pot-list. After a couple of rounds of cancellation we derive 
the contradiction 0 < —2, and the theorem has been proved. 

4 Linear Lemmas Revised 

When nonlinear arithmetic is enabled, we do the above procedure a little differ- 
ently. The gathering of polys from linear lemmas is intended to let the process 
of cancellation continue. In the procedure described in this section we still use 
linear lemmas, but we intertwine their use with other ways of gathering polys in 
preparation for what is to come - the nonlinear arithmetic procedure. 

4.1 Exploded Pot Labels, Bounds Polys, and Inverse Polys 

Previously, we examined pot labels to direct our gathering of additional polys 
from such sources as type-set and linear lemmas. That is, when there was a 
pot labeled with, say, x", we looked to type-set or linear lemmas for additional 
information about x". We shall soon examine “exploded” pot labels. These ex- 
ploded pot labels consist of the original pot label and, if the pot label is a 
product, each of the label’s factors. A few examples will make this clearer: 

— X => X 

— [xj ^ [xj 

— X • [xJ => X, [xJ , and x • [xJ 

— X ■ y ■ z X, y, z, and x ■ y ■ z 

We are doing this so that we can seed the database with information about 
the factors of products. Note that in the last example we do not examine, for 
instance, x • y. 

A “bounds poly” is a poly with exactly one unknown and can be considered 
to bound the unknown. For instance, 0 < x-L 1 can be considered to give a lower 
bound of —1 for x. Similarly, 0 < — 1 • x -I- 3 bounds x from above at 3. A term is 
said to have “good bounds” if there are bounds polys for that term which bound 
the term away from zero. This will become important later when we multiply 
certain polys. For example, we may wish to multiply, 0 < x • ^ and 0 < y to 
form the new poly 0 < x. But, since this requires rewriting y • ^ to 1, it can be 
done only if y is known to be non-zero. 

Division thus introduces additional issues. We represent the ratio x/y as x - A 
A term is said to “involve division” if it is of the form 
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1. i or (i)”, or 

2. or where c is a constant negative integer. 

As preparation for the nonlinear procedure, given such a term about division, 
ACL2 “adds its inverse polys.” We do not attempt to describe the method for 
generating these polys here other than to say that we gather our initial informa- 
tion from the bounds polys present in the pot-list. We give a few examples: 

— If we can determine that 4 < a;, we know both 0 < - and - < 1/4. 

11 XX 

— If we can determine that 0 < - and - < 3, we know 1/3 < a;. 

— If we can only determine that —2 < x, we do not know anything about 



4.2 The Revised Linear Lemmas Algorithm 

As mentioned above, when nonlinear arithmetic is enabled we do things a bit 
differently. As before, we start with a list of polys and a (possibly empty) pot 
list. We repeat the following until we reach a fixed point or are interrupted by 
the user aborting the proof attempt. 

1. Add the polys with the linear arithmetic algorithm as described in section 
2.1; if no contradiction was generated, go on to step 2. 

2. Make an exploded list of the labels of any new pots. If there are not any, 
quit and return the new pot-list. Otherwise, go on to step 3. 

3. For each item in this list: 

a) Add any polys created from type-set. 

b) For each applicable linear lemma: If we can relieve its hypotheses, add 
the concluding poly(s) to the pot-list as in section 2.1. 

c) If the item involves division, add any inverse polys 

4. If a contradiction was generated, quit and return it. Otherwise, go back to 
step 2. 

4.3 An Example Part I 

Let us consider 0<a A a <b 
the pot-list will look like 

label positives 

a 0 < a 

b 0 < 5 -I — 1 • a 

0<-l-&-i-kl 

a — a 

In step 2, we make the list a, b, b ■ Note the presence of which would 
not have been there if we used regular pot labels. In step 3a we create the poly 
0 < i, among others, from type-set and add it to the pot.^^ We will continue 
this example in Section 5.1. 



1 < b/a. After adding the initial polys, 

negatives 



12 



We also create the same poly in step 3c. 
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5 Nonlinear Arithmetic 

Before proceeding, let us pause a moment to recollect where we are. In Sec- 
tion 2.1, we presented the linear arithmetic algorithm which lies at the heart of 
ACL2’s arithmetic procedures - both linear and nonlinear. We next described 
the previously existing linear lemmas algorithm in Section 3.1. This algorithm 
uses the linear arithmetic algorithm and is still the default behaviour when non- 
linear arithmetic is not enabled. Next, in Section 4.2 we described a variant of 
the linear lemmas algorithm which is used when nonlinear arithmetic is enabled. 
Whereas, when nonlinear arithmetic is disabled, the previously existing linear 
lemmas algorithm is the outermost loop for arithmetic reasoning; the new vari- 
ant is only the middle loop when nonlinear arithmetic is enabled. We are now 
about to describe the outermost loop of the nonlinear arithmetic algorithm. 

The nonlinear arithmetic procedure consists of three subprocedures: deal- 
with-product, deal-with-factor, and deal-with-division. Each of these subproce- 
dures is guided by pot-labels and attempts to multiply polys. In order to multiply 
two polys, we unlinearize the polys (converting them back into ACL2 terms), cre- 
ate the term representing their product, use general-purpose rewriting to rewrite 
the product terms, and linearize the result. For example, the product of the two 
polys 0 < — 1 • a; -I- 3 and 0 < y + a is 0 < -l-y-x-\ — l-a:-a-|-3-y-|-3-a. 
In order to multiply two pots, form a list of the polys in each pot and multiply 
each poly in the first list with each poly in the second. We multiply more than 
two polys or pots by generalizing the above. 



5.1 Deal-with-Product and Deal-with-Factor 

When we have polys about a product and we have polys about the product’s 
factors, we can multiply those polys about the factors to form polys about the 
product and perhaps thereby allow cancellation to proceed. 

For instance, if we have a new pot about the product a-b-c, we can form new 
polys about the product by finding pots with any of the following combinations 
of labels and then multiplying the pots. 

— a, b, and c 

— a, and b ■ c 

— a ■ c, and b 

— a • b, and c 

This is done by the subprocedure deal-with-product. 

Similarly, if the new pot is about a, we look for pots of which a is a factor, 
such as a-b-c, and then see if we can complete the product. This is done by the 
subprocedure deal-with-factor. We use these two procedures in tandem so that 
we are less sensitive to the order in which pots are created. 

Let us revisit the example from 4.2. When we left it, having just added the 
poly from type-set, it looked like 
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label positives negatives 

a 0 < a 

i 0< i 

a a 

b 0<b 

0 < 5 H — 1 • a 

0<5-i 0<-l-&-i + l 

a a — a 

Both deal-with-product and deal-with-factor take a pot-label to consider and 
a pot-list. Deal-with-product will be used only with products, such as b- ^ above, 
while deal-with-factor will be used only with individual factors, such as a, b, and 

- above. 

a 

When deal-with-product is given b ■ it will find the pots for b and - 
and multiply the two pots. In particular, it will multiply the polys 0 < ^ and 
0 < 6 -I — 1-0 getting 0<6-^-| — 1-a-^. This will be rewritten to 

0 < 6 • - -k -1 
a 

since a is known to be non-zero. Upon adding this to the pot-list we would get 
the contradiction 0 < 0 and be done. 

When deal-with-factor is given a, it will do nothing because a is not a factor 
of any pot-labels. However, when it is given b, it will find the product b- The 
pot-label ^ will complete this product, and so deal-with-factor will multiply the 
pots for b and ^ with the same results as for deal-with-product. The pot label 

- is delt with similar. 

a 

5.2 Deal-with-Division 

Let us next consider 0 < 6 A b < a 1 < a/b. After executing the 

revised linear lemmas algorithm, the pot-list will look like 



label positives negatives 



a 


0 < a 






b 


0 < 6 


0 < 


— 1 • 6 -1- a 


1 

b , 


o<i 






«• b 


0 < a- 5 


0 < 


-1-a- i-kl 



This time, deal-with-product and deal-with-factor are insufficient. If we mul- 
tiply 0 < a and 0 < we get 0 < a • which we already knew via polys from 
type-set. Rather, we want to multiply the polys 0 < & and 0 < — 1 • a • ^ -I- 1. 
After rewriting a-b - ^ to a, we have 0 < 6-1 — 1 • a. Upon adding this latter poly 
to the pot-list, we get 0 < 0 and the lemma is proved. 

We now sketch the algorithm behind deal-with-division. 

1. If the current pot label being considered is itself a product, quit. If we have 
good bounds for the label, go to step 2; if not, quit. 

2. Make a list of all the pot labels that have the multiplicative inverse of the 
current label as a factor. To distinguish them from the current pot, we will 
refer to the pots these labels belong to as the “found” pots. For each entry 
in this list: 
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a) Multiply the bounds polys from the current pot and the polys in the 
found pot. 

b) Multiply the bounds polys from the found pot and the bounds polys 
from the current pot. 

Let us see how this lines up with what we said we wanted to do above. When 
deal-with-division is examining the pot-label b (which has good-bounds) it finds 
the pot-labels \ and a-\. For the second of these, since 0 < 6 and 0 < — 
are both bounds polys, we multiply these polys in step 2. As above, upon adding 
this latter poly to the pot-list, we get 0 < 0 and the lemma is proved. 



5.3 The Nonlinear Arithmetic Algorithm 

After adding polys as in Section 2.1, loop through the following at most three 
times. If at any point we generate a contradiction, quit and return it. 

1. Execute the revised linear lemmas algorithm, described in Section 4.2. 

2. Make a list (not an exploded list) of the labels from any new pots and for 
each item in that list: 

a) If we have good-bounds for the current item, carry out deal-with-division 
and add any polys generated. 

b) If the current item is a product, carry out deal-with-product and add 
any polys generated. 

c) If the current item is not a product, carry out deal-with-factor and add 
any polys generated. 

This concludes our presentation of ACL2’s nonlinear arithmetic algorithm. 

6 Conclusion 

The nonlinear arithmetic procedure is tightly integrated with the rest of ACL2 
and allows lemmas such as the following to be proven automatically. 

• This lemma was needed for an industrial project to verify the correctness of 
a microprocessor. It inspired our original work on nonlinear arithmetic and 
was an early success. 

e < a A a < d A i < h A h < g A g < f 
a - f — a-h<b + c- g — c-h 
=> e-f — e-i<b+c-g — c-h + d- h — d-i 

• Proving this equality helped us to refine deal-with-division. 
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Where be is the binomial coefficient defined by Pascal’s Triangle: 

(be i \; j) = 

if i < 0 or i < j then return 0 
elseif j <= 0 then return 1 
else return (be i — j) + (be i — j — ) 

• This was a long-standing challenge problem given to us by our sponsors. 
Previous versions of this proof required a dozen or more helper lemmas to 
be proven; we can now do the proof with only the one helper lemma given 
below. 

Consider the following 6502 assembly program to multiply two 8-bit num- 
bers: 





LDX 


#8 




LDA 


#0 


LOOP 


ROR 


FI 




BCC 


ZCOEF 




CLC 






ADC 


F2 


ZCOEF 


ROR 


A 




ROR 


LOW 




DEX 






BNE 


LOOP 



Multiply FI cuid F2, leaving 16 bit 

result in A auid. LOW 

Load X immediate with the integer 8 

Load A immediate with the integer 0 

Rotate FI right circular through C 

Brainch to ZCOEF if C = 0 

Set C to 0 

Set A to A+F2+C and C to the carry 
Rotate A right circular through C 
Rotate LOW right circular through C 
Set X to X-1 
Branch to LOOP if Z = 0 



The next lemma was the only one we needed to prove that the above code, 
generalized to an i-bit wide register, was correct. 

• We can also prove this final example automatically. It states that rotating 
right an i-bit wide register through a carry flag fits back into the i-bit wide 
register. 

X <2^ A integer i A integer x A (c=0Vc=l) 
floor(a;/2) -F c • 2®“^ < 2* 



Our nonlinear arithmetic extension to ACL2 provides significant benefits at a 
small cost. Proofs that do not involve any nonlinear inequalities are not affected 
and run at the same speed. A typical “small” lemma with a couple of nonlinear 
inequalities, which ACL2 could prove automatically before, will generally be 
proven within a few percentage points of the time previously required - but we 
can now automatically prove more of these. For more complicated lemmas and 
theorems, little can be said about the computer time required. 

However, within broad limits, the time a user takes to complete a proof is of 
greater importance than the time the computer takes. Examining failed proofs 
and writing helper lemmas can be time-consuming and psychologically draining. 
The fewer lemmas a user has to prove on the way to a desired result, the better. 
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Abstract. SAT-based Bounded Model Checking (BMC), though a robust and 
scalable verification approach, still is computationally intensive, requiring large 
memory and time. Interestingly, with the recent development of improved SAT 
solvers, it is frequently the memory limitation of a single server rather than time 
that becomes a bottleneck for doing deeper BMC search. Distributing 
computing requirements of BMC over a network of workstations can overcome 
the memory limitation of a single server, alheit at increased communication 
cost. In this paper, we present: a) a method for distrihuted-SAT over a network 
of workstations using a Master/Client model where each Client worsktation has 
an exclusive partition of the SAT problem and uses knowledge of partition 
topology to communicate with other Clients, b) a method for distributing SAT- 
based BMC using the distributed-SAT. For the sake of scalability, at no point 
in the BMC computation does a single workstation have all the information. We 
experimented on a network of heterogenous workstations interconnected with a 
standard Ethernet LAN. To illustrate, on an industrial design with ~13K FFs 
and ~0.5M gates, the non-disributed BMC on a single workstation (with 4 Gb 
memory) ran out of memroy after reaching a depth of 120; on the otherhand, 
our SAT-based distributed BMC over 5 similar workstations was able to go 
upto 323 steps with a communication overhead of only 30%. 



1 Introduction 

With increasing design complexity of digital hardware, functional verification has 
become the most expensive and time-consuming component of the product 
development cycle [1]. Verifying modern designs requires robust and scalable 
approaches in order to meet more-demanding time-to-market requirements. Formal 
verification techniques like symbolic model checking [2, 3], based on the use of 
Binary Decision Diagrams (BDDs) [4], offer the potential of exhaustive coverage and 
the ability to detect subtle bugs in comparison to traditional techniques like 
simulation. However, these techniques do not scale well in practice due to the state 
explosion problem. SAT solvers enjoy several properties that make them attractive as 
a complement to BDDs. Their performance is less sensitive to the problem sizes and 
they do not suffer from space explosion. As a result, various researchers have 
developed routines for performing Bounded Model Checking (BMC) using SAT [5- 
8]. Unlike symbolic model checking, BMC focuses on finding bugs of a bounded 
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length, and successively increases this hound to search for longer traces. Given a 
design and a correctness property, it generates a Boolean formula, such that the 
formula is true if and only if there exists a witness/counterexample of length k. This 
formula is then checked by a backend SAT solver. Due to the many recent advances 
in SAT solvers [9-13], SAT-based BMC can handle much larger designs and analyze 
them faster than before. 

The main limitation of current applications of BMC is that it can do search up to a 
maximum depth allowed by the physical memory on a single server. This limitation 
comes from the fact that as the search bound k becomes larger, the memory 
requirement due to unrolling of the design also increases. Especially for the memory- 
bound designs, a single server with a limited memory has now become the bottleneck 
to doing deeper search. 



1.1 Motivation 

Distributing computing requirements of BMC (memory and time) over a network of 
workstations can, however, overcome the memory limitation of a single server. In this 
paper, we explore this possibility, and discuss our approaches in a greater detail that 
made it feasible. Before we delve into that, we would like to give an intuition behind 
the feasibile solution. 

A BMC problem (described in Section 2) originating from an unrolling of the 
sequential circuit in different time frames provides a natural disjoint partitioning of 
the problem and thereby, allows the computing resources to be configured in a linear 
topology. The topology using one Master and several Clients is shown in Figure 1. 




Fig. 1. Partitioning of Unrolled Circuit 



Each Client C; hosts a part of the unroll circuit i.e., from UjH-l to where n^ 
represents the partition depth. Each Cj (except for the terminals) is connected to Cj^j 
and Cj.,. The Master is connected to each of the Clients. Using the linear topology, we 
can distribute parts of the unroll circuit dynamically over additional Clients as and 
when memory resources on current Clients get close to exhaustion. 

To check the satisfiability of a Boolean problem originating from BMC wherein 
the unrolled circuit is distributed over several servers, we must identify the part of the 
SAT algorithm that may be delegated to each processor without requiring any 
processor to have the entire problem data. Since Boolean Constraint Propagation 
(BCP) on clauses can be done independently on an exclusive partition, it can be 
delegated to each processor. Moreover, since about 80% of SAT time involves BCP, 
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one could achieve some level of parallelism by doing distributed-BCP. Note that any 
approach similar to SAT-based BMC can use similar concept to exploit parallelism. 

With this motivation we now briefly describe the organization of the rest of the 
paper. With a brief discussion on prior related work in Section 1.2, we give a short 
background in Section 2, our contributions in Section 3-7, experiments in Section 8, 
and conclusions in Section 9. 



1.2 Related Work 

Parallelizing SAT solvers have been proposed by many researchers [14-19]. Most of 
them target performance improvement of the SAT solver. These algorithms are based 
on partitioning the search space on different processors using partial assignments on 
the variables. Each processor works on the assigned space and communicates with 
other processors only after it is done searching its portion of the search space. Such 
algorithms are not scalable memory-wise due to high data redundancy as each 
processor keeps the entire problem data (all clauses and variables). 

In a closely related work on parallelizing SAT [16], the authors partition the 
problem by distributing the clauses evenly on many application specific processors. 
They use fine grain parallelism in the SAT algorithm to get better load balancing and 
reduce communication costs. Though they have targeted the scalability issue by 
partitioning the clauses disjointedly, the variables appearing in the clauses are not 
disjoint. Therefore, whenever a Client finishes BCP on its set of clauses, it must 
broadcast the newly implied variables to all the other processors. The authors 
observed that over 90% of messages are broadcast messages. Broadcasting 
implications can become a serious communication bottleneck when the problem 
contains millions of variables. 

Reducing the space requirement in model checking has been suggested in several 
works [20-22]. These studies suggest partitioning the problem in several ways. The 
work in [20] shows how to parallelize the model checker based on explicit state 
enumeration. They achieve it by partitioning the state table for reached states into 
several processing nodes. The work in [21] discusses techniques to parallelize the 
BDD-based reachability analysis. The state space on which reachability is performed 
is partitioned into disjoint slices, where each slice is owned by one process. The 
process performs a reachability algorithm on its own slice. In [22], a single computer 
is used to handle one task at a time, while the other tasks are kept in external memory. 
In another paper [23], the author suggested a possibility of distributing SAT-based 
BMC but has not explored the feasibility of such an approach. 



2 Background 



State-of-the-Art SAT Solver 

The Boolean Satisfiability (SAT) problem consists of determining a satisfying 
assignment for a Boolean formula on the constituent Boolean variables or proving 
that no such assignment exists. The problem is known to be NP-complete. Most SAT 
solvers [9-13] employ DPLL style [24] algorithm as shown in Figure 2 with three 
main engines: decision, deduction, and diagnosis. A Boolean problem can be 
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expressed either in CNF form or logical gate form or both. A hybrid SAT solver as in 
[12], where the problem is represented as both logical gates and a CNF expression, is 
well suited for BMC. 



SAT_Solve (P=l) { // Check if constraint P=1 satisfiable? 
while (Decide! ) =SUCCESS) //Selects a new variable 

while (Deduce () =CONFLICT) / /BCP till conf lict/no-conf lict 
if (Diagnose () =FAILURE) //Add. conflict learnt 
clause (s) 

return UNSAT ;/ /Conf lict found at decision level 0 
return SAT;} //No more decision to make 



Fig. 2. DPLL style SAT Solver 



Bounded Model Checking 

In BMC, the specification is expressed in LTL (Linear Temporal Logic). Given a 
Kripke structure M, an LTL formula /, and a bound k, the translation task in BMC is 
to construct a propositional formula [M, /7j, such that the formula is satisfiable if and 
only if there exists a witness of length k [25]. The satisfiability check is performed 
by a backend SAT solver. Verification typically proceeds by looking for witnesses or 
counter-examples (CE) of increasing length until completeness threshold [25, 26]. 
The overall algorithm of a SAT-based BMC for checking (or falsifying) a simple 
safety property is shown in the Figure 3. The SAT problems generated by the BMC 
translation procedure grow bigger as k increases. Therefore, the practical efficiency of 
the backend SAT solver becomes critical in enabling deeper searches to be performed. 



BMC (k, P) { / /Ealsify safety property P within bound k 
for (int i=0; i<=k ; i++) { 

Pi=DnrolI (P, i) ; //Get property node at i*^"^ unrolled frame 
if (SAT_SoIve(Pi=0) =SAT) return CE;//Try to falsify 

} 

return NO_CE; } //No counter-example found 



Fig. 3. SAT-based BMC for Safety Property P 



3 Our Contributions 



Overview of Distributed-SAT 

Given an exclusive partitioning of the SAT problem, we give an overview of the fine 
grain parallelization of the three engines of the SAT algorithm (as described in 
Section 2) on a Master/Client distributed memory environment. The Master controls 
the execution of distributed-SAT. The decision engine is distributed in such a way 
that each Client selects a good local variable and the Master then chooses the globally 
best variable to branch on. During the deduction phase, each Client does BCP on its 
exclusive local partitions, and the Master does BCP on the global learned conflict 
clauses. Diagnosis is performed by the Master, and each Client performs a local 
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backtrack when request by the Master. The Master does not keep all problem clauses 
and variables; however, the Master maintains the global assignment stack and the 
global state for diagnosis. This requires much less memory than the entire problem 
data. To ensure proper execution of the parallel algorithm, each Client is required to 
be synchronized. We give details of the parallelization and different communication 
messages in Section 5-9. 



Novelties of Our Approach 

In this paper, we present a method for distributing SAT over a network of 
workstations using a Master/Client model where each Client worsktation has an 
exclusive partition of the SAT problem. Though this work is closely related to [16], 
there are some important differences: a) In [16], though each Client has disjoint set of 
clauses, variables are not disjoint. So, Clients after completing BCP, broadcast their 
new implications to all other Clients. After decoding the message, each receiving 
Client either reads the message or ignores it. In a communication network where BCP 
messages dominate, broadcasting implications can be an overkill when the number of 
variables runs into millions. In our improved distributed BCP, however, each Client 
has the knowledge of the SAT-problem partition topology and uses that to 
communicate with other Clients. This ensures that the receiving Client has to never 
read a message that is not meant for it. b) The algorithm in [16] is developed 
primarily for application specific processors, while our algorithm uses easily available 
existing networks of workstations. We have described several innovative optimization 
schemes to reduce the effect of communication overhead on performance in general- 
purpose networks by identifying and executing tasks in parallel while messages are in 
transit. 

In this paper, we also extend the SAT-based BMC (as a part of our formal 
verification platform called DiVer) using topology-cognizant distributed-SAT to 
obtain a SAT-based distributed BMC over a distributed-memory environment. For the 
sake of scalability, our method makes sure that at no point in the BMC computation 
does a single workstation have all the information. We developed our distributed 
algorithms for a network of processors based on standard Ethernet and using the 
TCP/IP protocol. We can also potentially use dedicated communication 
infrastructures that may yield better performance, but for this work, we wanted to use 
an environment that is easily available, and whose performance can be considered a 
lower bound. We used a socket interface message passing library to provide standard 
bidirectional communications primitives. 



4 Topology- Cognizant Distributed-BCP 

BCP is an integral part of any SAT solver. We distribute BCP on multiple processes 
that are cognizant of topology of the SAT-problem partition running on a network of 
workstations. In [16], during the distributed-SAT solve each Client broadcasts its 
implications to all other processors. After decoding the message, each receiving 
process either reads the message or ignores it. We improve this approach in the 
following way. Each process is made cognizant of the disjoint partitioning. The 
process then sends out implications to only those processes that share the partitioning 
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interface variables with it. Each receiving process simply decodes and reads the 
message. This helps in two ways: a) the receiving buffer of the process is not filled 
with useless information; b) receiving process does not spend time in decoding 
useless information. This ensures that the receiving process has to never read a 
message that is not meant for it. 

We use a distributed model with one Master and several Client processors. The 
Master’s task is to distribute BCP on each Client that owns an exclusive partition of 
the problem. A bi-directional FIFO (First-in First-out) communication channel exists 
only between the process and its known neighbor, i.e., each process is cognizant of its 
neighbors. The process uses the partition topology knowledge for communication so 
as to reduce the traffic of the receiving buffer. A FIFO communication channel 
ensures that the channel is in-order, i.e., the messages sent from one process to 
another will be received in the order sent. Besides distributing BCP, the Master also 
records implications from the Clients as each Client completes its task. 

The main challenging task for the Master is to maintain causal-effect (“happens 
before”) ordering of implications in distributed-BCP since we cannot assume channel 
speeds and relative times of message arrivals during parallel BCP. Maintaining such 
ordering is important because it is required for correct diagnosis during conflict 
analysis phase of SAT. In the following we discuss the problem in detail and 
techniques to overcome it. 

Consider the Master/Client model as shown in Figure 1 . Client Cj can communicate 
with Cj j and C;,^| besides the Master M. The Master and Clients can generate 
implication requests to other Clients; however. Clients can send replies to the Master 
only for the request made to it. Along with the reply message. Client also sends the 
message ids of the requests, if any, it made to the other Clients. This is an 
optimization step to reduce the number of redundant messages. To minimize reply 
wait time, the Master is allowed to send requests to the Clients even when there are 
implications pending from the Client provided that the global state (maintained by the 
Master) is not in conflict. 

Fet p->q denote an implication request from p to q and p<-q denote implication 
replies from q to p. Note that though the channel between Cj and the Master is in- 
order, what happens at the Event E3 cannot be guaranteed in the following. 

E1:M->C1 
E2: C1->C2 
E3:M<-C2 or M<-C1 

If M<-C2 “happens before” M<-C1, then we consider it an out-of-order reply since 
the implications due to M<-C2 depend on C1->C2, which in turn depend on M->C1. 
Moreover, any out-of-order reply from a Client makes subsequent replies from that 
Client out-of-order until the out-of-order reply gets processed. 

We propose a simple solution to handle out-of-order replies to the Master. For each 
Client, the Master maintains a FIFO queue where the out-of-order replies are queued. 
Since the channel between a Client and Master is in-order, this model ensures that 
messages in the FIFO will not be processed until the front of the FIFO is processed. 
We illustrate this with a short event sequence. For simplicity we show the contents for 
FIFO for the Client C2. 



E1:M->C1 
E2: C1->C2 



FIFO(C2): - 
FIFO(C2): - 
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E3: M->C2 
E4: M<-C2 (in 
E5: M<-C2 (in 
E6: M<-C1 (in 



response to E2) 
response to E3) 
response to El) 



FIEO(C2): - 
EIEO(C2): E4 
FIEO(C2): E4,E5 

FIFO(C2): - (E4 is processed before E5) 



Note that in the reply event E6, the Client Cl also notifies the Master of the event E2. 
Master queues E4 reply as an out-of-order reply as it is not aware of the responsible 
event E2 until E6 happens. E5 reply is also queued as out-of-order as earlier out-of- 
order reply E4 has not been processed yet. When E6 occurs, the Master processes the 
messages from the events E6, E4 and E5 (in the order). This maintains the ordering of 
the implications in the global assignment stack. 



5 Distributed-SAT 

We use fine grain parallelism in our distributed-SAT algorithm similar to the one 
proposed in [16]. However, we use the topology-cognizant distributed-BCP (as 
described in the previous section) to carry out distributed-SAT over network of 
workstations. First, we describe the task partitioning between the Master and Clients 
as shown in the Figure 4. 



Master 



SAT-BasedBMC 



Client 




Fig. 4. Distributed-SAT and SAT-based Distributed-BMC 
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Tasks of the Master 

• Maintains list of constraints, global assignment stack, learnt clauses, antecedents 

• Selects a new decision variable from the best local decision sent by each Client 

• Global conflict analysis using the assignments and antecedents 

• Local BCP on clauses; manages distributed-BCP 

• Receives from Ci: New implications with antecedents and best local decision 

• Sends to Ci: Implication on variables local to Ci variables, backtrack request, learnt 
local clauses, update score request 

Tasks of a Client C,. 

• Maintains the ordered list of variables, scores, local assignment stack, local learnt 
clauses 

• Keeps the exclusive partition of the problem and topological information 

• Executes on request: Backtrack, decay score, update variable score, local BCP 

• Receives from Master: Implications, backtrack request, update score, clause 

• Receives from neighbor Cj : Implications on interface 

• Sends to Master: New Implications with antecedents and best local decision, best 
local decision when requested, conflict node when local conflict occurs during 
BCP, request id when implication request comes from other Clients 

• Sends to neighbor Cj: New implication requests on interface 



6 SAT-Based Distributed-BMC 

A SAT-based BMC problem originating from an unrolling of the sequential circuit 
over different time frames has a natural linear partition and thereby allows 
configuring the computing resources in a linear topology. The topology using one 
Master and several Clients is shown in Figure 1. Each Client Cj is connected to Cj,^; 
and Cj j. The Master controls the execution of the SAT-based distributed BMC 
algorithm. The BMC algorithm in Figure 3 remains the same except for the following 
changes. The Unroll procedure is now replaced by a distributed unrolling in which 
the procedure Unroll is actually invoked on the Client that hosts the partition for the 
depth i. Note that depending on the memory availability, the host Client is decided 
dynamically. After the unrolling, the distributed-SAT algorithm is invoked (in place 
of SAT_Solve) to check the satisfiability of the problem on the unrolled circuit that 
has been partitioned over several workstations. Following are the tasks distribution of 
the Master and Clients. 

Tasks of the Master 

• Allocates an exclusive problem partition to each host Client (box 300 in Figure 4) 

• Requests an unrolling to the terminal Client (box 301 in Figure 4) 

• Controls distributed-SAT as described in Section 5 

Tasks of a Client 

• Handle current unroll request and also advance by one (box 302 in Figure 4) 

• Initiate a new Client as defined by the topology when new unroll size is too large 

• Participate in distributed-SAT 
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7 Optimizations 

Memory Optimizations in Distributed-SAT 

The bookkeeping information kept by the Master grows with the unroll depth. The 
scalability of our distributed-BMC is determined by how low is the ratio of the 
memory utilized by the Master to the total memory used by the Clients. Following 
steps are taken to lower the scalability ratio: 

• By delegating the task of choosing the local decision and maintaining the ordered 
list of variables to the Client, we save the memory otherwise used by the Master. 

• Master does not keep the entire circuit information anytime. It relies on the Clients 
to send the reasons of implications that will be used during diagnosis. 

In our experiments, we observed that the scalability ratio for large designs is close to 
0.1, which implies that we can do a 10 times deeper search using a distributed-BMC 
as compared to a non- distributed (monolithic) BMC over network of similar machines 
(In our observation, the global learnt clauses maintained by Master is not 
exponentially large). 

Tight Estimation of Communication Overhead 

Inter-workstation communication time can be significant and adversely affects the 
performance. We can mitigate this overhead by hiding execution of certain tasks 
behind the communication latency. To have some idea of communication overhead, 
we first need some strategy to measure the communication overhead and actual 
processing time. This is non-trivial due to asynchronous clock domain of the 
workstations. In the following, we first discuss a novel strategy to make tight 
estimation of the wait time incurred by the Master due to inter-workstations 
communication in Parallel BMC. 

Consider a request-reply communication . Time stamps are local to the Master and 
Client. At time T^, the Master sends its request to the Client. The Client receives the 
message at its time t^. The Client processes the message and sends the reply to the 
Master at time t,,. The Master, in the meantime, does some other tasks and then starts 
waiting for the message at time T^. The Master receives the message at time T . 
Without accounting for the Client processing time, wait time would be simply. 

Wait Time = T -T if T > T (= 0 otherwise) 

This calculated wait time would be an over-estimation of the actual wait time. To 
account for the Client processing time, we propose the following steps: 

• Master sends the request with T,, embedded in the message. 

• Client replies back to the Master with the time stamp (T H-( t,,- y). 

• The Master, depending on the time T,, , calculates the actual wait time as follows: 

• CaseTwl: T^ < (T^-f( t^- 1_.)) Wait _Time = Tr-(T^-i-( t^- 1_.)) 

• Case Tw2: (T,,-i-( t^- 1,.)) < T,, < T,. Wait _Time = T^ -T,^ 

• Case Tw3: Tj < T^ Wait_Time = 0 



Performance Optimizations in Distributed-SAT 

Now we discuss several performance optimizations in the distributed-SAT algorithm. 

• A large number of communication messages tend to degrade the overall 
performance. We took several means to reduce the overhead: 
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• The Master waits for all Clients to stabilize before sending a new implication 
request. This reduces the number of implication messages sent. 

• Clients send their best local decision along with every implication and backtrack 
replies. At the time of decision, the Master, then, only selects from the best local 
decisions. It is not required to make explicit requests for a decision variable to 
each Client separately. 

• For all implication requests. Clients send replies to only the Master. This 
reduces the number of redundant messages on the network. 

• Client sends active variables to the Master before doing the initialization. While 
the Master waits and/or processes the message, the Client does its initialization in 
parallel. 

• When Master requests each Client to backtrack, it has to wait for the Clients to 
respond with a new decision variable. The following overlapping tasks are done to 
mitigate the wait time: 

• Local backtrack (box 207b in Figure 4) by the Master is done after the remote 
request is sent (box 207b in Figure 4). While the Master waits for the decision 
variable from the Client, the Master also sends the learnt local conflict clauses 
to the respective Client. 

• The function for adjusting variable score (box 217 in Figure 4) is invoked in the 
Client after it sends the next decision variable (during backtrack request from 
the Master) (box 216 in Figure 4). Since message-send is non-blocking, 
potentially the function is executed in parallel with send. On the downside, the 
decision variable that is chosen may be a stale decision variable. However, note 
that the local decision variable that is sent is very unlikely be chosen as decision 
variable. The reason is that in the next step after backtrack there will be an 
implication. Since the Client sends the decision variable after every implication 
request, the staleness of the decision variable will be eventually eliminated. 

Performance Optimization in SAT-Based Distributed-BMC 

• The design is read and initialization is done in all the Clients to begin with. This 
reduces the processing time when the unrolling is initiated onto a new Client. 

• Advance unrolling is done in the Client while the Client is waiting for implication 
request from the Master. This includes invoking a new partition in a new Client. 



8 Experiments 

We conducted our evaluation of distributed -SAT and SAT-based distributed BMC on 
a network of workstations, each composed of dual Intel 2.8GHz Xeon Processor with 
4Gb physical memory running Red Hat Linux 7.2, interconnected with a standard 
lOMbps/lOOMbps/lGbps Ethernet LAN. We compare the performance and scalability 
of our distributed algorithm with a non-distributed (monolithic) approach. We also 
measure the communication overhead using the accurate strategy as described in 
Section 7. 

We performed our first set of experiments to measure the performance penalty and 
communication overhead for the distributed algorithms. We employed our SAT-based 
distributed algorithm on 15 large industrial examples, each with a safety property. For 
these designs, the number of flip-flops ranges from ~1K to ~13K and number of 2- 
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input gates ranges from ~20K to -0.5M. Out of 15 examples, 6 have counter 
examples and the rest do not have counterexample within the bound chosen. We used 
a Master (referred to as M) and 2 Clients (referred as Cl and C2) model where Cl and 
C2 can communicate with each other. We used a controlled environment for the 
experiment under which, at each SAT check in the distributed-BMC, the SAT 
algorithm executes the tasks in a distributed manner as described earlier except at the 
time of decision variable selection and backtracking, when it is forced to follow the 
sequence that is consistent with the sequential SAT. We also used 3 different settings 
of the Ethernet switch to show how the network bandwidth affects the communication 
overheads. We present the results of the controlled experiments in Table l[a-b]. 

In Table la, the 1“ Column shows the set of designs (D1-D6 have a 
counterexample), the 2°“* Column shows the number of Flip Flops and 2-input Gates in 
the fanicone of the safety property in the corresponding design, the 3"* Column shows 
the bound depth limit for analysis, the 4* Column shows the total memory used by the 
non-distributed BMC, the 5“* Column shows the partition depth when Client C2 took 
an exclusive charge of the further unrolling, Columns 6-8 show the memory 
distribution among the Master and the Clients. In the Column 9, we calculate the 
scalability ratio, i.e., the ratio of memory used by the Master to that of the total 
memory used by Clients. We observe that for larger designs, the scalability factor is 
close to 0.1 though for comparatively smaller designs, this ratio was as high as 0.8. 
This can be attributed to the minimum bookkeeping overhead of the Master. Note that 
even though some of the designs have same number of flip-flops and gates, they have 
different safety properties. The partition depth chosen was used to balance the 
memory utilization; however, the distributed-BMC algorithm chooses the partition 
depth dynamically to reduce the peak requirement on any one Client processor. 



Table 1 [a-b]. Memory & Performance evaluation of the distributed SAT-based BMC 
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Table 2. Comparison of monolithic and distributed BMC on Industrial designs 
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In Table lb, the T‘ Column shows the cumulative time taken (over all steps) by 
non-distributed BMC, the 2°‘“ Column shows the cumulative time taken (start to finish 
of Master over all steps) by our distributed-BMC excluding the message wait time. 
Columns 3-5 show the total message wait time for the Master in a lO/lOO/lOOOMbps 
Ethernet Switch setting. In the Column 6, we calculate the performance penalty by 
taking the ratio of the time taken by distributed to that of non-distributed BMC (=Para 
Time/ Mono Time). In the Column 7, we calculate the communication overhead for 
the IGbps switch setting by taking the ratio of the message waiting time to distributed 
BMC time (=wait time for 1 Gbps/ Para Time). On average we find that the 
performance penalty is 50% and communication overhead is 70% with overall 
degradation by a factor of 2.55 (=1.5 * 1.7). 

In some cases, D12-D15, however, we find an improvement in performance over non- 
distributed BMC. This is due to the exploitation of parallelism during the Client 
initialization step as described in Section 7. Note that the message wait time adversely 
gets affected with lowering the switch setting from IGbps to 10Mbps. This is 
attributed to the fact that Ethernet LAN is inherently a broadcast non-preemptive 
communication channel. 

In our second set of experiments, we used the 5 largest (of 15) designs D11-D15 
that did not have a witness. For distributed-BMC, we configured 5 workstations into 
one Master and 4 Clients C1-C4; each connected with the IGbps Ethernet LAN. In 
this setting. Clients are connected in a linear topology and the Master is connected in 
a star with others. In this experiment, we show the ability of the distributed-BMC to 
do deeper search using distributed memory. For the design Dll, we used a partition 
of 81 unroll depths on each Client and for designs D12-15, we used partition of 401 
unroll depths on each Client. The results are shown in the Table 2. 

In Table 2, the T‘ Column shows the set of large designs that were hard to verify, 
the 2”“* Column shows the farthest depth to which non-distributed BMC could search 
before it runs out of memory, the 3'“* Column shows the time taken to reach the depth 
in the 2°““ Column, the 4* Column shows the unroll depth reached by distributed-BMC 
using the allocated partition, the 5“* Column shows the time taken to reach the depth in 
the 4“* Column excluding the message wait time. Columns 6-10 show the memory 
distribution for the Master and Clients, the 11th Column shows the total message wait 
time. In the Column 12, we calculate the communication overhead by taking the ratio 
of message wait time to the distributed-BMC time (=MWT time/ Para Time). In the 
Column 13, we calculate the scalability ratio by taking the ratio of memory used by 
the Master to that of the total memory used by the Clients. 

We use the design Dll with ~13K flip-flops and ~0.5Million gates to show the 
performance comparison. For the design Dll we could analyze up to a depth of 323 
with only 30% communication overhead, while using a non-distributed version we 
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could analyze only up to 120 time frames under the per- workstation memory limit. 
Low scalability factor, i.e., 0.1 for large designs indicates that for these designs our 
distributed-BMC algorithm could have gone 10 times deeper compared to the non- 
distributed version for similar set of machines. We also observe that the 
communication overhead for these designs was about 45% on average, a small 
penalty to pay for deeper search. 



9 Conclusions 

For verifying designs with high complexity, we need a scalable and robust solution. 
SAT-based BMC is quite popular because of its robustness and better debugging 
capability. Although, SAT-based BMC is able to handle increasingly larger designs 
than before as a result of advancement of SAT solvers, the memory of a single server 
has become a serious limitation to carrying out deeper search. Existing parallel 
algorithms either focus on improving the SAT performance or are used in either 
explicit state-based model checkers or in unbounded implicit state-based model 
checkers. To the best of our knowledge ours is the first detailed study on providing a 
feasible solution for SAT-based distributed-BMC using an improved distributed SAT 
algorithm. 

Our distributed algorithm uses the normally available large pool of workstations 
that are inter-connected by standard Ethernet LAN. Eor the sake of scalability, our 
distributed algorithm makes sure that no single processor has the entire data. Also, 
each process is cognizant of the partition topology and uses the knowledge to 
communicate with the other process; thereby, reducing the process’s receiving buffer 
with unwanted information. We have also proposed several memory and performance 
optimization schemes to achieve scalability and decrease the communication 
overhead. 

In the future, we would like to evaluate our distributed-SAT and SAT-based 
distributed-BMC on a clustered system for high performance computing that has low 
latency and high bandwidth communication [27]. 

Acknowledgements. We thank Guoqiang Pan for implementing the socket-based 
message-passing library. 
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Abstract. We consider the problem of bounded model checking of sys- 
tems expressed in a decidable fragment of first-order logic. While model 
checking is not guaranteed to terminate for an arbitrary system, it con- 
verges for many practical examples, including pipelined processors. We 
give a new formal definition of convergence that generalizes previously 
stated criteria. We also give a sound semi-decision procedure to check 
this criterion based on a translation to quantihed separation logic. Pre- 
liminary results on simple pipeline processor models are presented. 



1 Introduction 

Systems with parameters of finite but arbitrary or large size are often modeled 
as infinite-state systems. Such systems include superscalar processors, communi- 
cation protocols with unbounded channels, and networks of an arbitrary number 
of identical processes. While state elements can still be of Boolean type, richer 
data types such as unbounded integers or unbounded arrays of integers are also 
used. Employing this richer expressive power is one approach to tackling the 
state explosion problem. 

In the area of hardware verification, the logic of Equality with Uninterpreted 
Functions and Memories (EUFM) has been successfully used for the automated 
verification of pipelined processor designs [8,3]. The more general logic of Counter 
Arithmetic with Lambda Expressions and Uninterpreted Functions [4] (CLU) 
has been used for bounded model checking and inductive invariant checking 
of out-of-order microprocessors with unbounded resources [14]. Bounded model 
checking proceeds by symbolically simulating the system for a finite number of 
steps starting from an initial state, checking on each step that a state property 
holds. As the state elements can be terms in a first-order logic, we will refer 
to this technique as term-level bounded model checking. Since term-level models 
can express Turing machines [12], the symbolic simulation might never reach 
a fixpoint in general. However, in many practical cases, the simulation does 
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D. Geist and E. Tronci (Eds.): CHARME 2003, LNCS 2860, pp. 348—362, 2003. 
© Springer- Verlag Berlin Heidelberg 2003 




Convergence Testing in Term-Level Bounded Model Checking 349 



converge. It is therefore necessary to check, after each simulation step, whether 
the simulation has converged. 

In this paper, we make two main contributions. First, we give a formal defi- 
nition of convergence for term-level bounded model checking, where CLU logic 
is used as the modeling formalism. The convergence criterion is formulated as 
a quantified second-order formula with one quantifier alternation and is unde- 
cidable in general. Second, we give a semi-decision procedure for this class of 
second-order formulas. Our procedure is based on a sound translation to a de- 
cidable fragment of first-order logic called quantified separation logic (QSL). QSL 
formulas are quantified Boolean combinations of Boolean variables and predi- 
cates of the form Xi < Xj + c or Xi = Xj + c, where Xi and Xj are real or 
integer variables, and c is a constant. The QSL formulas are then decided by a 
translation to quantified Boolean logic [15]. Although we use the semi-decision 
procedure for convergence checking, our results are also more generally applica- 
ble to automated theorem proving of second-order formulas. 

Previous term-level model checkers vary in expressiveness of the underlying 
logic, and either use syntactic convergence criteria or approximation techniques 
that guarantee convergence at the cost of completeness. Hojati et al. [12] pre- 
sented a modeling formalism called ICS which is similar in expressiveness to 
EUFM. They showed that ICS models do not converge in general, except under 
highly restrictive assumptions that are not of practical interest. Isles et al. [13] 
built on this work, giving a conservative, syntactic definition of convergence of 
ICS models, and using it to verify versions of the DLX pipeline. Our logic is more 
expressive than ICS. Also, as we show in Section 5.2, their convergence criterion 
is a special case of the one we present in this paper. Corella et al. [9] have used 
Multiway Decision Graphs (MDGs) for term-level model checking. MDGs are 
BDD-like data structures used for representing formulas in quantifier-free logics 
such as EUFM and CLU; the exact logic represented depends on the set of in- 
terpreted function symbols used in the model. Thus, Corella et al. use MDGs to 
represent the characteristic function of the set of states of a term-level model. 
Unlike our work, their models cannot have variables of function type, and hence 
cannot verify systems with embedded memories. However, they address a more 
general class of properties expressible in a first order temporal logic. With respect 
to convergence checking, Corella et al. use syntactic rewriting techniques similar 
to those employed for ICS [13]. Bultan et al. [6] have used Presburger arithmetic 
for verifying concurrent algorithms. Checking convergence for systems expressed 
in Presburger arithmetic is decidable; however, since the model checking might 
not converge in general, they conservatively approximate the fixpoint, allowing 
the possibility of spurious counterexamples. In comparison, our use of CLU logic 
allows us to use uninterpreted functions and also lets us model richer systems 
with memories. This expressive power, however, results in convergence checking 
becoming undecidable. 

The rest of the paper is organized as follows. Section 2 presents CLU logic 
and our system modeling formalism. Section 3 defines the term-level bounded 
model checking problem. In Section 4, we formally define the convergence cri- 
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terion. Section 5 describes how we check this criterion. Finally, we conclude in 
Section 6 with some preliminary results with pipelined processor models. For 
brevity, we have omitted proofs of theorems and an alternate complete semi- 
decision procedure; these can be found in an accompanying technical report [5] . 

2 Preliminaries 

2.1 CLU Logic 

Syntax. The syntax includes four classes of expressions, representing computa- 
tions of truth values or integers, as well as functions over integers yielding truth 
values or integers. We use symbols to represent abstract values and functions. 

bool-expr true | false | bool-symbol \ -ibool-expr \ {bool-expr A bool-expr) 

I {int-expr= int-expr) \ {int-expr<int-expr) 

I predicate- expr{int-expr , . . . , int-expr) 
int-expr::— lambda-var\ int-symbol \ ITE(bool-expr, int-expr, int-expr) 

I int-expr int-constant \ function- expr{int-expr, . . . , int-expr) 
predicate- expr ;:= predicate-symbol \ A lambda-var , . . . , lambda-var . bool-expr 
function-expr ::= function- symbol \ A lambda-var, . . . , lambda-var . int-expr 

Fig. 1. Expression Syntax. Expressions can denote computations of Boolean values, 
integers, or functions yielding Boolean values or integers. 

Symbols are written with a typewriter font, such as a or f. Associated with 
each symbol is a type indicating what kind of value it represents (truth, integer, 
function, or predicate). For function and predicate symbols, the type includes 
its arity indicating the number of arguments it takes. For function symbol f , we 
write its arity as arity {f). For a set of symbols A, we let E{A) denote the set of 
all expressions that can be formed using these symbols, obeying the usual rules 
on type matching. 

The syntax includes integer lambda variables. These only serve to represent 
the arguments to lambda expressions. Note also that the lambda expression 
syntax is constrained so that they cannot have functions as arguments, and they 
cannot express any form of looping or recursion. 

Sets of Expressions. We use two ways to refer to sets of expressions in 
which we must identify the different elements. The first is a vector notation, in 
which we index the elements with integer subscripts. We use the notation to 
denote a vector with elements ei, . . . ,e„. The second is a named-element nota- 
tion, in which we have a set of symbolic names A and write a set of expressions 
e as having an element Ca for each a G A. 

With both notations, we can indicate the syntactic substitution of elements 
for symbols or variables in an expression. That is, the expression s [eF/iCn] de- 
notes the expression where each instance of Xi in s is replaced by the expression 
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Ci for 1 < z < n. These substitutions are performed in parallel, so there is no am- 
biguity of some expression contains the symbol Xj. Similarly, s \e/ A\ indicates 
the result of replacing each instance of a symbol a G A with the expression Ca- 

Semantics. For a set of symbols A, we let a_A indicate an interpretation 
of each of these symbols. That is, maps each symbol to an integer, a truth 
value, or a function according to the symbol type. For any expression e G E{A), 
we define its evaluation under interpretation denoted (e)^^ as the value 
obtained by evaluating e when each symbol a is replaced by its interpretation 
cr^(a). We omit the detailed definition. 

A truth expression e G E(A) is said to be universally valid when it evaluates 
to true for all interpretations of its symbols, i.e., when (e)^^ = true for all cr^. 

As a final notation, for disjoint symbol sets A and B, each having interpre- 
tations (t _4 and (Tg, we let cr _4 • erg denote the interpretation over the symbols in 
A\J B obtained by applying the respective interpretations to the symbols in A 
and B. 

As noted earlier, our syntax for function applications requires all arguments 
to be integer expressions. We can therefore transform any integer or truth ex- 
pression containing lambda expressions into an equivalent lambda-free one by 
performing Beta reduction, in which the actual parameter expressions are syn- 
tactically substituted in parallel with the actual parameter expressions. 

2.2 System Model 

We model the system as having a number of state elements, where each state 
element may be a truth or integer value, or a function or predicate. This latter 
class of state elements allows us to describe various forms of memories. For 
example, a conventional random-access memory can be modeled as a function 
that yields an integer data value given an integer address as argument. We use 
symbolic names to represent the different state elements giving the set of state 
symbols S. We also introduce a set of input symbols T, representing a set of input 
signals that can be set to different values on each step of operation. That is, on 
each step i, we introduce a symbol for each input symbol a. We refer to such 
signals as the indexed input symbols. We introduce two more sets of symbols /C 
and I to allow one run by the verifier to compute the behavior of systems with 
different functionality operating with different initial state and input values. The 
symbols in /C parameterize system functionality. This could include, for example, 
function symbols for the ALU, and the contents of the instruction memory. The 
symbols in I parameterize the initial state and system input sequence. These 
could include a function symbol to encode the initial state of a memory. They 
also include the indexed input symbols. 

The overall system operation is characterized by an initial state and a 
transition behavior 6. The initial state contains an expression for each state 
element. The initial value of state element a is given by an expression G E(I). 
The transition behavior consists of an expression for each state element. The 
behavior for state element a is given by an expression G E(/CUSU7~). In this 
expression, we use the state element symbols to represent the current system 
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state, and the input symbols to represent the current values of the inputs. The 
expression then gives the new state for that state element. 

From these expressions, we define the state sequence for the system 
s°, . . . , s*, . . . , where the state at step i consists of an expression for each state 
element s* G i?(/CUl). This expression is given by performing the double sub- 
stitution 



4 = 4 [s*-V5, tVT] , (1) 

where the input expression t* has t* = for each a G T. As mentioned earlier, we 
always perform Beta reduction following a substitution such as this. We use the 
shorthand s* = to indicate this process of generating the expressions 

for the state at step i. 

3 Property Checking 

A system property P is represented as a Boolean expression over the state el- 
ements P G E{S). Typically we want to determine whether P holds at some 
particular step k, or whether P holds at every step. We can determine whether 
P holds at some particular step k by applying a decision procedure for CLU 
logic. However, our interest here is to prove that P holds for every step i > 0. 
In general, this task is undecidable. The problem remains undecidable even if we 
restrict the class of systems to ones with only integer state elements, and where 
the system behavior is described using a logic of equality with uninterpreted 
functions [12]. 

Instead, we focus on a more restricted class of systems that satisfy a property 
we call k-convergence. With these systems, every reachable state can be reached 
within k steps for some combination of initial state and inputs, for some fixed 
bound k. If we can prove that a system is /c-convergent, then we can guarantee 
property P holds on every step by verifying that it holds on every step up 
through s^. 

Formally, we say that a system with initial state and transition behavior 
6 converges in k steps, when for every interpretation ax of the initial state and 
inputs and for every interpretation ax: of the system parameters, there exists a 
step i < k and an alternate interpretation 9x of the initial state and inputs, such 
that for every state symbol a.G S 

■ ( 2 ) 

We use the shorthand ctx-<tk i^idicate this equality for every 

state element. Property (2) states that by step A: -1-1, the system will not reach any 
new states. That is, for every possible interpretation of the system parameters 
9]C, and for every possible operation of the system for fc-|- 1 steps, as determined 
by the interpretation ax of the initial state and indexed input symbols X, there is 
some alternate initial state and input sequence, given by interpretation 9x that 
would have led to the exact state in i steps for some 0 < i < k. 
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We show that this property guarantees that the system will not reach new 
states beyond step k. 

Theorem 1. If a system converges in k steps, then for any j > 0 and any 
interpretation crjc of the system parameters, there exists a step i < k and an 
alternate interpretation Ox of the initial state and inputs, such that 

= (s') ■ ( 3 ) 

\ / fix- O'/C ' 'OX'O/C 



4 Formulation of the Convergence Criterion 



We now reach the main topic of this paper: determining whether a system is 
fc-convergent for some value of k. We can express this as a problem in second- 
order logic as follows. Introduce a symbol set fl consisting of a symbol a' for 
each initial state symbol a G I, and a symbol a' G I for each indexed input 
signal ai, for 1 < z < fc. Rewrite each state expression s*, for 0 < z < fc to an 
expression by replacing each symbol in I with its counterpart in fl. 

Using the notation of predicate calculus, we consider the symbols in I, J, 
and /C to be quantified variables, either first-order (for integer or Boolean sym- 
bols) or second-order (for function or predicate symbols). We can then write the 
convergence criterion as: 



V/C VI 3J 



V = 

0<2<fc 



( 4 ) 



With these quantifiers, we are really quantifying over the possible interpretations 
of the symbols. Note that this formula cannot be expressed in first-order logic, 
because we have existentially quantified function symbols. 



Example 1. Consider a system with the integer state variables x, y and Boolean 
state variable b. The operations are defined by: 



init[x] = Co init[y] = cq init[b] = true 

next[x] = f (x) next[y] = f (y) next[b] = (x = y) 



where cq is an integer symbol and f is an uninterpreted function symbol. Using 
our notation, the sets of symbols are defined as follows — 5 = {x, y, b}, /C = {f }, 
I = {co} and J = {cq}. 

After simulating the system for one step, the convergence condition (given 
by equation 4, where fc = 0) becomes: 

Vf Vco 3c'o [c'o = f (co) A cj, = f (cq) A true = (f (cq) = f(co))] 

which simplifies to Vf Vcq 3cg [cg = f (cq)], which is clearly valid, with Cg taking 
the value f(co). 

Therefore the system converges after one step of simulation. As expected, 
the state variable b is always true in the reachable set of states. 
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For a function or predicate state element F, the expression rp = is a 
second-order equation — it states that two functions or predicates are identical 
for all possible arguments. 

For systems without function or predicate state elements, our convergence 
criterion yields a formula with the quantification structure shown in (4), with 
only first-order equations. Even for the simple case of a system with one integer 
symbol in X, one function symbol of arity 2 in /C, deciding the truth of a formula 
with this structure is undecidable [2]. 

Again we find ourselves facing an undecidable property. We deal with this 
by 1) using syntactic transformations to eliminate the second-order equations 
for function and predicate state elements, and 2) using a sound, but incomplete 
decision procedure for second-order formulas of the form shown in (4). Our 
procedure is quite simple, but it seems to work well for the formulas arising in 
our convergence testing. 



5 Checking Convergence 

5.1 Function and Predicate State Elements 

We can convert our convergence formula (4) to one containing only first-order 
equations by introducing a set of argument symbols Z = zi,... ,z„, where n 
is the maximum arity of any predicate or function state element. Suppose state 
element F has arity arity (F) = m. Then define fp = rp(zi, . . . , Zm), and similarly 
define Sp = Sp(zi, . . . , z^). Then we can rewrite the convergence criterion as: 



V/C VI 3J VZ 



V = 

0<i<k a£(S 



( 5 ) 



Unfortunately, we have no general approach to handle formulas with this 
quantifier structure. Instead, we use rewriting techniques to handle limited forms 
of function and predicate state elements. Our technique is sufficient to handle 
random-access memories, including the data memory and register file of a mi- 
croprocessor. 

A random-access memory is modeled as a function state element Mem where 
the argument is an address, and the function returns the value stored at that 
address. Consider a memory with address input Adr, data input Dat and write- 
enable signal Wrt. We describe the memory operation in our term- level modeling 
language as: 



in it [Mem] = mo 

next[Mem] = Xx . ITE(yirt Ax = Adr, Dat, Mem(a:)) 

where mo is an uninterpreted function giving the initial memory contents. Note 
the restricted class of expressions that will result when modeling the operation 
of this memory over time to generate the expression the base is an 
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uninterpreted function, which can be assigned an interpretation that matches 
any desired functionality. There will then be a bounded number of updates due 
to write operations, but these will each be to a single (symbolic) address. 

Suppose we wish to determine whether the system has converged for some 
fixed time point i, so that Equation 5 reduces to 



V/C VI 3J VZ 




(6) 



Then the convergence criterion for state element Mem will have the general form: 

V^3^VzF'(z) =F(z) (7) 

where expression F has only symbols in A, while expression F' has symbols from 
both B and A. 

We apply a set of rewrites to the symbols in B and generate a set of verifica- 
tion conditions that guarantees (7) holds, based on the structure of expression 
F' . In general, our rules apply to equations of the form P(z) F'{z) = F{z), 
where P is a predicate expression with symbols from both B and A. At the top 
level, we start with P being an expression that always yields true. 

1. For equations of the form P(z) f^(z) = P(z), where f' is a 

function symbol in B, rewrite all occurrences of f ' in F to be Xx . 
ITE{P{x), F{x), f'{x)). 

2. For equations of the form P(z) A z = E F'{z) = P(z), where E 

is an expression with symbols from both B and A, reduce the equation 
to P{E) F'{E) = F{E). This eliminates any reference to z in the 

equation. 

3. For equations of the form P(z) => [Ax . ITE{Q{x), G'{x), P['{x))] (z) = 

P(z), where Q, G' , and FI' are predicate and function expressions containing 
symbols in both A and B, we generate two verification conditions: P(z) A 
(5(z) G"(z) = P(z), and P(z) A -•Q(z) =k ff'(z) = P(z), and solve 

these recursively. 

4. For equations of the form P(z) f(z) = P(z), where f is a function 

symbol in A, we recursively analyze the structure of P. 

— If P is of the form ITE{Q{x), G(x), H{x)), where Q, G, and FI are 
predicate and function expressions containing symbols in A, we generate 
two verification conditions: P(z) A Q(z) f(z) = G(z), and P(z) A 

-•Q{z) f(z) = H{z), and solve these recursively. 

— If P is of the form g(z), then the symbols f and g need to be the same. 
If the two symbols are different, we return false which implies that no 
rewrite exists. 

5. For equations of the form P(z) P'(z-l-c) = P(z) with integer constant 
c, transform the equation to be P(z — c) =k P'(z) = P(z — c), and solve 
it recursively. 
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Similar rules hold for equations of the form P F'{z) = F{z), i.e., P is a 

Boolean expression independent of z. 

Given the special form of the expressions describing the updating of a 
random-access memory, we can see that by repeated application of these rules, 
we can eliminate all occurrences of symbol z in (6). The first rule handles the 
uninterpreted function representing the initial memory state. The second rule 
handles updates to individual memory addresses. The third rule lets us split 
based on the case structure of the expression. The last two rules would be re- 
quired for more complex memory structures. 

Note that CLU logic can be used to model memories in which multiple entries 
can be updated in parallel [14]. The rewriting techniques proposed in this section 
do not work for such memories. 

5.2 Convergence with First-Order Equations 

Assume we have applied transformation rules to eliminate all second-order equa- 
tions, and hence the convergence criterion is expressed by an equation of the form 
shown in (4) with only first-order equations. We would therefore like to decide 
the validity of a formula ip of the form 

ip = yA3B<p (8) 

where (p does not contain any quantifiers. In fact, (pis a, CLU formula, and we can 
assume that transformations have been applied to eliminate all ITE operations^ 
and lambda applications. 

Our system model is sufficiently general that we can generate any second- 
order formula having the structure shown in (8) as part of a convergence test. 
To see this, let the variables in (p he A = and B = bm- Introduce a set of 
m -|- 1 state elements, consisting of an element for each existentially quantified 
variable bi G B, and a final truth- valued state element qm-i-i. For each universally 
quantified variable Oi G A, introduce a system parameter a^. Let the system 
have transition behavior 5 such that i5q„+i = (p [qA/bm, ^/on], and i5q^ = for 
1 < i < m. Finally, let the initial state s°. of each state element q^ for 1 < f < m 
be tti, and the initial state of q^+i be true. Then the system is 0-convergent if 
and only if the formula VA 3B (p is valid. 

This construction shows that we cannot assume any particular restrictions 
on the formulas we must decide to prove convergence, other than the quantifier 
structure shown in (8). 

Syntactic Approach. Previous approaches to convergence have been based on 
finding syntactic similarities between the earlier state r* and the current state 
The convergence criterion given by Isles et al. [13] is a more conservative 
check than the criterion we give in Equation 5, and hence is less general. We 
can see that their syntactic substitution-based technique is simply a strategy for 
proving the validity of a formula with the structure shown in (8) as follows. 

^ These can be eliminated by the “push to the leaves” transformation [16]. 
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Proposition 1. Let b denote a set containing an expression ba € £i(A) for each 
aG B. If^A (f) [h/B] is valid, then so is 3B (f>. 

The proof of this proposition follows by instantiating any symbol a G B with the 
value (6a)^^. 

With this approach, we can prove convergence by using a decision procedure 
for CLU logic to prove the universal validity of </> [b/B]. The challenge, of course, 
is to find an appropriate set of substitutions to the symbols in B. 



Semantic Approach. We describe a way to transform formulas of the struc- 
ture Ip = VA 3B <j) into a formula in the logic we call Quantified Separation 
Logic (QSL). QSL consists of quantified Boolean and integer variables. Boolean 
connectives, and predicates of the form x = y -|- c and x < y -|- c, where x and 
y are integer variables, and c is an integer constant. Our translation Tgipp) (for 
“sound” ) yields a formula that is valid only if ip is valid. By deciding the validity 
of the translation we can test for definite convergence. 

We can rewrite any Boolean or integer expression in CLU into a normal form, 
in which all LTE operations have been eliminated, and the additions of integer 
constants are grouped together. Define an atomic expression as either an integer 
or Boolean symbol, or an application of a function or predicate symbol. 

Without loss of generality, let us assume </> is in normal form. We start by 
enumerating all of the atomic expressions occurring in ^ as a sequence gi, ■ ■ ■ ,gn- 
Let top{gi) denote the top-level symbol in subexpression gi. We can see that each 
atomic expression gi must be of one of the following forms: 

1. Boolean symbol, gi = b, giving top{gi) = b. 

2. Predicate application, g^ = p{gi^ + a^i , ... ,gi,^+ Ci^k), giving top{gi) = p. 

3. Integer symbol, gi = x, giving top{gi) = x. 

4. Function application, gi = f{g,,^ + Ci_i, ... ,g^,^+ Ci^k), giving top{gi) = f. 

We require the sequence to be ordered according to subexpression containment. 
That is, for the function and predicate application forms listed above, we require 
ii < i for I < I < k. The soundness property of translation Ts holds for any such 
ordering, but we get a tighter bound by listing the subexpressions having top- 
level symbols in A as early as possible. That is, if top{gi) G A and top{gj) G B, 
then i < j, unless gj is a subexpression of gi. 

Now introduce a sequence of symbols vF = vi, . . . , v„, where is an integer 
(respectively. Boolean) symbol when top{gi) is an integer or function symbol 
(respectively.. Boolean or predicate symbol). We generate two formulas and 
(7g, each of which is a conjunction of consistency constraints by considering each 
pair of subexpressions gi and gj, with i < j and top{gi) = top{gj). These are the 
same constraints used by Ackermann for removing function applications from a 
formula [1]. For subexpression gi of the form f( 5 q -I- ... ,gi^+ Ci,k), and gj 

of the form f ((/j^ + Cjp , ... ,gj,,+ cj^k), we include the constraint 

v*i =Vji + (cj,i - Cj,i) A • • • A =Vj^^ + {cj^k - Ci^k) 



Vi = Vj ( 9 ) 
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This constraint is included in either C _4 or Cg according to whether i G A or 
f G B. Similar constraints are generated when the top-level symbol in gi and gj 
is a predicate symbol p. 

Let (j) be the formula generated by replacing each atomic expression gi in 
(f) with the symbol v^. We always replace maximal subexpressions, so that the 
resulting formula no longer contains any symbols from <p. 

Let quantifier Qi be V when top{gi) G A, and 3 when top{gi) G B. 

The soundness-preserving translation of ij) is given by 



Ts{tp) = QiVi Q2V2 • ■ ■ Q„V„ 




{Cb a 



(10) 



Theorem 2. For any formula ip having the structure ip = \!A 3B (p, ifTs{ip), 
as given by (10), is valid, then so is ip. 

We also provide a completeness preserving translation in [5] . We can test for 
possible convergence by deciding the validity of this translation. 

We now give some examples to demonstrate the capabilities and limitations 
of our translation method. 

Example 2. Our first example is a case where we successfully prove soundness. 

Vf,y [Vxx = f(x)] ^ y = f(f(y)) (11) 

To get this into the required form, we rewrite it as 

Vf,y3x [-.(x = f(x)) V y = f(f(y))] 

We write the subexpressions as follows. To make the resulting formulas more 
readable, we introduce symbols with names based on the subexpressions, rather 
than the more generic vi, V 2 , . . . , v„: 



Subexpression 


9i 

y 


92 

f(y) 


93 

f(f(y)) 


94 : 

X 


95 

f(x) 


Symbol 


y 


fy 


ffy 


X 


fx 



For Cjx we then get 

(x = y fx=fy)A(x = fy ^ fx^ffy) A (y = fy ^ fy = ffy) 

For formula Cb we get true, while for <p we get -■(x = fx) V y = f fy, and the 
overall quantifier structure is: 



Vy Vfy Vf fy 3x Vfx 

It can be easily shown that the QSL formula is valid. We omit the details. 
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Example 3. Our second example illustrates a case where the formula is valid, 
but the soundness-preserving transformation fails to show this. 

Vf [Vx f(x)<f(x-l- 1)] =4> [Vy f(y)<f(y-k 2)] (12) 

To get this into the required form, we rewrite it as 

Vf Vy3x-.(f(x)<f(x+l)) V f(y)<f(y-k2) 

We write the subexpressions as follows. 



Subexpression 


9i 

y 


92 

f(y) 


93 

f(y + 2) 


94 

X 


95 

f(x) 


96 

f(x+l) 


Symbol 


y 


fy 


fy2 


X 


fx 


fxl 



For we then get 

(x = y fx = fy) A (x = y- 1 fxl = fy) 

A(x = y-|-2 f x= fy2) A (x = y -I- 1 ==k fxl = fy2) 

For formula Cg we get true, while for (j) we get 

-'(fx<fxl) V fy<fy2 

and the overall quantifier structure is: 

Vy Vfy Vfy2 3x Vfx Vfxl 

This formula is not valid. 

This example shows the limited capability of our translation Tg. It does not 
do the multiple instantiations of x required to replace the quantified antecedent 
in (12) with f(y) <f (y -I- 1) A f (y -f 1) < f (y -|- 2). 

6 Results and Discussion 

We have implemented a prototype of the convergence testing framework within 
the UCLID [4] verification tool. Currently, we have only implemented the 
soundness-preserving translation to QSL. The QSL solvers use different tech- 
niques to transform a QSL formula to a quantified Boolean formula (QBF) [15]. 
All the experiments are performed on a 2GHz Pentium-4 running Linux, with 1 
GB of memory. 

In this section, we describe our experience with the convergence testing 
framework for a three-stage arithmetic pipeline given in figure 2. This example 
originated with the first work on symbolic model checking [7], and has subse- 
quently become a standard for verification research [10,13]. In our version, we 
make use of both stalling and forwarding to resolve read-after-write hazards in 
the pipeline. Previous versions used only forwarding, with the result that a new 
result is written to the register file on each step of operation. 
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Fig. 2. Pipelined Version of ALU Circuit. The three stages of the pipeline: fetch, 
execute and write-back. Read-after-write hazards are resolved for the first operand by 
stalling and for the second by forwarding. The dashed lines indicate Boolean control 
and the solid lines represent the flow of integer values. 



The state elements of the pipeline include a function state variable, an un- 
bounded register file pRF . The integer state elements include the different reg- 
ister identifiers, namely eSRC2, eDEST and wDEST, the data values eARGl, 
eARG2 and wVAL, and the program counter pPG. The Boolean state elements 
consist of the write enable registers eWRT and wWRT . The system functionality 
is parameterized by uninterpreted function symbols for decoding an instruction, 
updating the program counter and the ALU. The Boolean state elements are 
initialized to false and the rest of the state elements take on arbitrary initial 
values. 

The pipeline was symbolically simulated starting from the initial state. The 
QSL formula produced by the soundness preserving translation was false after 
k = \ and k = 2 steps of simulation. A look at the Boolean state elements 
indicated that the system indeed does not converge within two steps. However, 
after fc = 3 steps of simulation, the QSL formula produced was too large to be 
solved with the current QSL solver implementation we use [15]. The formula had 
53 quantified integer variables, with 6 levels of quantifier alternations, 836 nodes 
in a Directed Acyclic Graph (DAG) representation of the formula, and the BDD 
representing the QBF formula exceeds 1 GB of memory. However, we have been 
able to prove the convergence of two simplified versions of the pipeline processor. 

1. For the first case, we removed the data-path components of the processor 
including the register file, operand values and the write-back value. The re- 
maining pipeline still contains the entire control complexity of the original 
pipeline including the stalling and the forwarding mechanisms. This model 
converges after fc = 3 steps of simulation and our decision procedure detects 
so within 2 seconds with less than 11 MB of memory. The QSL formula con- 
tains 27 quantified integer variables, with 4 levels of quantifier alternations 
and 249 nodes in the DAG form. Notice that this example contains uninter- 
preted function symbols but does not contain any function state elements. 








Convergence Testing in Term-Level Bounded Model Checking 361 



2. For the second case, we combined the execute and the write-back stages of 
the pipeline into a single stage (making the pipeline 2-stage), but retained 
the register file pRF and the data-path. The pipeline was modified to ac- 
commodate both stalling and forwarding of data. This example converges 
after k = 2 steps of simulation and our decision procedure takes 8 seconds 
to prove it valid. The memory consumption was about 80 MB. The QSL 
formula contains 29 quantified integer variables, with 4 levels of quantifier 
alternations and 203 nodes in the DAG form. 

We are currently working on alternate translations of QSL formulas to QBF 
formulas and hope to test the convergence of the pipeline with a few optimiza- 
tions. We are also experimenting with enumeration based QBF solvers including 
Quaffle [17]. 

Discussion. The notion of fc-convergence is not useful for systems with un- 
bounded buffers, since many such systems do not converge. Moreover, our pre- 
liminary results indicate that the convergence criterion we present is precise, but 
computationally difficult to check. Abstraction techniques, such as predicate ab- 
straction [11], allow for greater efficiency at the expense of using an approximate 
notion of convergence, and are a promising area for future work. 
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Abstract. Reduced Ordered Binary Decision diagrams (ROBDDs) are 
nowadays one of the most common dynamic data structures for Boolean 
functions. Among the many areas of application are verification, model 
checking, and computer aided design. In the last few years, SAT checkers, 
based on the CNF representation of Boolean functions are getting more 
and more attention as an alternative to the ROBDD based methods. We 
show the difference between the CNF representation and the ROBDD 
representation in one of the most degenerate cases - random monotone 
2CNF formulas. We examine this model and give almost matching lower 
and upper bounds for the ROBDD size in different cases, and show that 
as soon as the formulas are non-trivial the ROBDD size becomes expo- 
nential, thus showing perhaps one of the most fundamental advantages 
of SAT solvers over ROBDDs. 



1 Introduction 

Automatic manipulation of formulas in propositional logic is of major importance 
in both theoretical and practical computer science. In the VLSI and process 
analysis communities Reduced Ordered Binary Decision Diagrams (ROBDDs) 
are popular. Their usage, initiated by Bryant [B86], has caused a considerable 
increase of the scale of systems that can be verified. In the last few years SAT 
checkers have appeared as a very competitive alternative to the ROBDD based 
techniques, Clarke et al. [BCCF99] probably being the initiator of this trend. 

It is a common place saying that ROBDDs and SAT complement each other, 
i.e., there are cases where the ROBDD technique will work better, and those 
where SAT will. Indeed, Groote and Zantema [GZOI] show that the ROBDD 
proof of the pigeon hole principal takes exponential size ROBDDs while the unit 
resolution proof is polynomial. In the other direction, they also give a family of 
formulas, where an ROBDD based proof is polynomial, while already the CNF 
representation is exponential. Ideally, for understanding the different faults and 
merits of both techniques, we would like to have a characterization of the size 
relation between the two representations of boolean formulas - in CNF form, and 
in ROBDD form. Hopefully, such an understanding will help in the construction 
of a new data structure which will combine the good qualities of both ROBDDs 
and SAT solvers. 

There has been some previous work on the size of ROBDDs, Gropl et al. 
[GPSOl] for example, investigates the largest possible size of an ROBDD over 
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all functions over n variables. Bollig and Wegener [BWOO] examine the worst 
case ROBDD size of a function with a given number of 1-inputs (among other 
questions). Woelfel [WOl] gives very tight bounds on the ROBDD size of the 
integer multiplication function, which was one of the first examples of a func- 
tion with a polynomially sized circuit but an exponential size ROBDD, proved 
originally by Bryant [B86]. 

In this paper we examine a very degenerate type of CNF formulas, monotone 
2CNF formulas, consisting only of clauses with 2 variables, and no negation. We 
consider random monotone 2CNF formulas with n variables where each of the ( 2 ) 
possible clauses is chosen with probability p. These formulas are clearly always 
satisfiable, and the (expected) number of satisfying assignments depends on p 
(this number decreases as p increases). Moreover, the simple syntactic structure 
of these formulas may lead one to believe that their ROBDD structure is succinct. 
We show that this is far from being true. 

In this work, we present a full characterization of the ROBDD size of random 
monotone 2CNF formulas. Namely, for practically every value of p, we study the 
ROBDD size of such random formulas and present matching (up to low order 
terms) lower and upper bounds on this size. Our results show that except for very 
small p, where the formula is degenerate, or very large p, where the formula has 
only a polynomial number of satisfying assignments, the most probable ROBDD 
size (under any ordering of the variables in the formula) is highly exponential, 
very closely related to the number of satisfying assignments to the formula. Thus 
we show that the ROBDD reductions are of little use when handling these simple 
CNF formulas. 

Let (pp be a random monotone 2CNF formula with n variables, in which 
each of the ( 2 ) possible clauses is chosen with probability p. Our results can be 
(roughly) summarized as follows: 



1 . 



2 . 



3 . 



Let p < (1 — e)-, where e > 0 is constant. Notice that in this case a random 
formula Pp is expected to have less than n/2 clauses (implying that each 
variable is expected to appear at most once in pp). Then w.h.p. the ROBDD 
size of Pp is polynomial. 

Let p satisfy (a) (1 -I- e)i < p for some constant e > 0, and (b) For every 

constant a > 0, p < 1 ^ {i.e. p is not very small or large). Then w.h.p. 

the ROBDD size of pp is super polynomial. Specifically, we show that for 
small values of p in the range defined above, the ROBDD size of pp is in 



the range 



2^ polylogn 



and for large values of p, the ROBDD size 



of ^ loe^n \ 

of Pp is equal to 2 V / (w.h.p.). For example for p = Ijy/n the 

ROBDD size of pp is roughly and for p = 1/2 this size is roughly 

2 iog n _ pi}ogn^ Notice the sharp jump in the ROBDD size, with respect to 
case 1 above, with a very small increase of p. 



If there exists some constant a > 0 such that p > 1 — then w.h.p. the 
ROBDD size of pp is again polynomial. 
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An important point in these bounds, is that the upper bounds in items 2 
and 3 above are derived by showing an upper bound to the number of satisfy- 
ing assignments to the formula. The fact that these bounds practically match 
the lower bounds means that the ROBDD reductions are of very little use for 
these kinds of formulas - we might as well have written a list of all satisfying 
assignments as a description of the formula. 

Along the way, we show that for small p, it is the pathwidth of the formula 
which determines the optimal ROBDD size. This parameter captures in a simple 
manner the concept of information flow that is caused by the variable ordering 
in the ROBDD method. In our restricted setting, this result can be seen as 
a matching lower bound to Berman’s [B89] classic upper bound on ROBDD 
size, relating circuit structure and ROBDD size using a notion similar to our 
pathwidth. Also, this result formalizes the common sense intuition of ROBDD 
ordering, and thus shows one of the fundamental drawbacks of ROBDDs, if an 
ordering does not put related variables close to one another - the ROBDD size 
will be large. 

The remainder of this paper is organized as follows. In Section 2 we present 
the main definitions and notation that will be used throughout this work. Specifi- 
cally we show a natural characterization of random monotone 2CNF formulas Pp 
on n variables by the distribution Qn,p on graphs with n vertices. In Section 3 
we show a connection between the ROBDD size of monotone 2CNF formulas 
and certain combinatorial graph properties. We then define the pathwidth of a 
formula, a notion which plays a major role in our analysis. Finally, in Section 4 
we state the upper and lower bounds sketched above rigorously and proceed in 
their proof. Due to space limitations, some of our results appear without detailed 
proof. A full version can be found at, 

http : //www. wisdom, weizmann. ac . il/~verify/publications/2003/LPR03 .html 



2 Preliminaries and Notation 

2.1 Graphs 

For a graph G, denote its set of vertices by V, and its set of edges by E. Let 
n be the size of V, and m be the size of E. We denote by d{G) the maximum 
degree of a vertex in G. For a set of vertices U Q V define its set of neighbors 
as Eq{U) = {v \ V ^ U,3u & U,{u,v) & E}. Denote the subgraph induced 
by a subset U of vertices as G|j^, i.e., G\^ = {U,EC\{U x U)). We say U CV is 
an independent set if the edge set of G|^ is empty. Let ID(G) denote the set of 
independent sets of the graph G. Denote the size of the largest independent set 
in G by maxID(G). The definitions above imply that, 

Proposition 1. |ID(G)| < TjmaxiD(G) 

Let Gv be the set of graphs on vertex set V. For short, we mark = G[i,n]- 
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2.2 Boolean Formulas 



Let Ay denote the set of Boolean assignments to the variable set V, Ay = 
{a I a : F — >■ {0, 1}}. Let = W \ ^ Ay} denote the set of all Boolean 

formulas on the variable set V {ip \s characterized by its set of satisfying assign- 
ments). For a G Ay, U C V, denote by a|^ G Au the restriction of assignment 
a to the set U. We would also like to consider the restriction of the formula (p 
to a partial assignment. For (p G ^y, U C V, and some a G An, let 



^\c 



= 



v\u 



37 G = a and 



Again we will mark and Z\„ = „j. 



2.3 Random Monotone 2CNF Formulas 

In 2.2 we considered only the semantics of boolean formulas by characterizing 
them using their satisfying set of assignments. We now proceed to consider the 
representation of a formula, its syntax. We consider a restricted class of CNF 
formulas, monotone 2CNF formulas. A monotone 2CNF formula over variable 
set V is the conjunction of a set of clauses of the form (aV 6) where a, b are in V . 
We can equivalently model such a formula by a graph G G Qy, where each edge 
{a,h) in the graph stands for the clause (a V h). We then get that the formula 
corresponding to the graph G is 

(PG = {ot & Ay I V(z, j) G E{G),a{i) = 1 or a{j) = 1} 

We will consider such random formulas, using the random model Qn,p, where 
G G Qn,p is a graph on vertices [l,n], where each possible edge is in the graph 
with probability p, uniformly and independently. We will say an event in Qn,p 
happens with high probability if it happens with probability tending to 1 as n 
approaches infinity. 



2.4 ROBDDs Reduced Ordered Binary Decision Diagrams 

Definition 1. An OBDD on [l,n] is a edge labeled directed graph, whose sinks 
are labeled by Boolean constants FALSE and TRUE, and whose non sink (or 
inner) nodes are labeled by elements o/[l,n]. Each inner node has two outgoing 
edges, one labeled by 0 and the other by 1. An edge leading from an i-node must 
end in a sink or a j-node, where j > i. Each inner node v with label k, represents 
a Boolean formula py G defined in the following way. In order to check 

if a € Pv, ex G start at v. After reaching an i-node, choose the outgoing 

edge with label a{i), until a sink is reached. If the label of the sink is TRUE then 
a G Pv, if it is FALSE then ex ^ pv. The size of the OBDD is defined to he its 
number of nodes. 
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Bryant [B86] has already shown that the minimal size OBDD for a formula 
(fi G <Pn unique (up to isomorphism), and is called the ROBDD of (p. If we 
add an additional requirement, that every edge leaving an i-node, reaches a sink 
or an (i -|- l)-node, then we get a slightly different version of ROBDDs, called 
Quasi-reduced OBDDs (QOBDDs). In this paper we will actually consider this 
latter type, because of the following two lemmas (see [BWOO] for example): 

Lemma 1. The number of i-nodes, 1 < i < n, of the QOBDD of p G T’n is 
\{v\^ I a G 

Lemma 2. If sr is the size of the ROBDD of ip G <Pn, and sq is the size of its 
QOBDD, then ^sq < sr < sq. 

The first Lemma allows us to deal with the size of QOBDD in a simple manner, 
and the second Lemma shows that the size of QOBDDs is practically the same 
as that of ROBDDs, especially since all size lower bounds we show will have an 
exponential nature. Therefore, for the remainder of the paper, we will examine 
only QOBDDs. For p G we denote by BDD(t^), the size of <^’s QOBDD. 
For simplicity, we will not count the root node and the two leaf nodes of the 
QOBDD when calculating BDD(<p), this changes the QOBDD size by at most 
3, and so is immaterial. We get the following proposition. 

Proposition 2. For p G ^n, BDD((p) = YZ=i \ {t\^ \ a G Ak}\ 

We note the following useful upper bound on QOBDD size. 

Proposition 3. For p G BDD(<p) < n{\p\ 1). 

Proof. By Proposition 2, 

n— 1 n— 1 

BDD((/?) = ^ I a G Z\fc}| < ^ (|{a G Z\fc | yf 0 } | -k l) 

k=l k=l 

For every a G Ak, such that pj^ yf 0, there is at least one f3 G p s.t. = a. 

Choose one of these /3 and mark it by Pa - Clearly if oi yf «2 then Pa-, yf Pa 2 J and 
so I {a G Life I p\^ 7 ^ ®}| — \t\ and we conclude, BDD(t^) < (n — 1)(|:/?| + 1) < 
n{\p\ + l). □ 

As is well known, the QOBDD of a formula p depends on the specific ordering 
of variables in p. Denote by S'„ the set of permutations on the set [l,n]. For a 
formula p G and a permutation cr G S'„, denote 

p"^ = {a I 3/3 G p, Vu G V, a(cr(v)) = P(v) } 

p'^ is the result of changing the names of the variables of p. This change may 
result in a change of BDD(i^), and in fact there are known examples (see for 
example [CGP]), where BDD(i^) is polynomial, while for some a, BDD(<^'^) is 
exponential. We therefore denote, 

mBDD((/?) = min BDD((^'^) 

(y^Sn 

Clearly, Proposition 3 applies also to mBDD((/?). 
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3 QOBDD Size vs. Combinatorial Graph Properties 

Let G be a graph in Let G be the 2CNF formula corresponding to 

G. In this section we show various connections between combinatorial properties 
of G and the size of the QOBDD of (p. We will need the following definition. For 
a G A„ denote Za = {v € V \ a(v) = 0}. 

Lemma 3. ID(G) = {Za | a G (/?} 

Proof. Let Z be an independent set in G. Consider the assignment a which 
assigns a value of 0 to every vertex in Z and a value of 1 to the remaining 
vertices m.V\Z . Clearly Z = Z^, furthermore as Z is independent we conclude 
that a & Lp implying that Z G {Za \ a G pc}- For the other direction, consider 
an assignment a £ p. By the definitions above, Zq. must be an independent set 
in G. □ 



Corollary 1. For p G BDD((/?) < n(|ID(G)| + 1). 



Theorem 1. For G G Qn, Setting, 




F = Fg{I) r\[k + l,n], /gId(G|j^^j)} 



The size of the k+1 level in p’s QOBDD (under natural ordering) is either \Ag\ 
or \Ag\ + 1 



Proof. Consider the set 

= {P\a. I « S T\a. ^^)- 

The size of the k+1 level in (^’s QOBDD (under natural ordering) is exactly the 
size of A,p, possibly plus 1, if there is some a s.t. p\^ = %. Hence, it suffices to 
present a one to one function from yl,^ to Aq and vice versa. For the first direction 
consider the function which associates with every p\^ the set O [/c + 1 , n] 

(where Z^, is as defined above). As p\^ ^ ^ we have that Za in as independent set 
in G|jj^ Now assume two formulas p\^ and p\^ that are not equal. Namely 
(w.l.o.g.) there exists some assignment [3 G such that P £ p\^ but 

P ^ p\^^. For i = 1, 2 let 7 i G be the assignment obtained by concatenating 
ai and p. By these definitions "fi £ p and 72 ^ p. Hence, it must be the case 
that 72 violates some clause, say the clause including the Fth and j’th variables, 
where i < j (that is 72(f) = 72 (j) = 0). 

Now (by contradiction) assume that A = Fg{Zc^) fl [/c + l,n] is equal to 
A = FciZa^) n [A: + l,n]. Recall that is a monotone 2CNF formula, it is 
satisfied by 71 = ai/3, and it is not satisfied by 72 = a2/3- Moreover, P \^^ is not 
equal to 0. By the fact that p is satisfied by 71 we conclude that all variables in 
Fi = A have value 1 under the assignment P implying that they have value 1 
both in the assignment 71 and 72. Hence, it cannot be the case that i or j belong 
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to / 2 - By the fact that [1, k] \ is set to 1 in 72 it cannot be the case that i 
or j are in [1, /c] \ Z^^ . By the fact that 0 it cannot be the case that both 

% and 2 are in Z^,^ . We conclude that it must be the case that both i and j are 
in [fc + 1, n] \ / 2 - But the value of such i and j are determined by /?, and by the 
fact that 7 i = ai/3 € ip we conclude that either the value of i or j is 1 in 72 . 

For the other direction, consider the function which associates with each 
r G Ag the assignment a G which is defined as follows. Let Z be some 

independent set in such that r^Z) fl [/e+ l,n] = F, define a{i) to be 

zero iff f G Z. As Z in an independent set in it is the case that % 

and thus in A^p. Let Fi = /^(Zi) fl [/c + l,n] and /2 = r{Z 2 ) H [fc + l,n] be 
two different subsets in Aq- We will show that for corresponding ai and 02 as 
defined above the functions Lp\^ and differ. Let (w.l.o.g.) i be a vertex in 
Fi \ F 2 (note that i G [fc + 1, n]). Let (3 G A\^k+i,n] be defined such that (3{i) = 0 
and j3{j) = 1 for all j i. The vertex i is connected by an edge to Zi implying 
that the assignment 71 which is the concatenation of ai and (3 does not satisfy 
p. We conclude that (3 ^ . On the other hand , the vertex i is not connected 

to any vertices in Z 2 , implying (in a similar manner) that (3 G ■ □ 

In the following, we define the notion of the pathwidth of a graph (as in- 
troduced in [RS83]). Given an ordering of the vertices of a given graph G the 
pathwidth of G is defined as follows: 

Definition 2. For G G Gn, denote PW(G) = maxfcg[i_„] |r'G([l, fc])|. 

Next we present upper and lower bounds on the QOBDD size of p using 
the pathwidth notion. Afterwards we show that the pathwidth of a graph is 
monotone with respect to edge contractions and vertex and edge deletions. We 
will use this property later on in Section 4. 



3.1 Upper Bound 

Lemma 4. BDD(t)) < -I- 1) 



Proof. Using Theorem 1 we need to show that for every k the size of the set 
{FG{I)n[k + l,n] I / gID(G|j^_,j)} 



is of size at most However, since I C [l,fc], then \Fq{I) fl [fc -I- l,n]| < 

|/g([ 1,A:])| < PW(G), and therefore the number of possible sets of the form 
Fg{I) n [fc -I- l,n] is at most □ 



3.2 Lower Bound 

We first state without proof the following lemma, which is proved using a simple 
greedy strategy. 

Lemma 5. For G G Gn, maxID(G) > 
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PW(G) 

Lemma 6. BDD(i^) > 2 w7)+i)^ 



Proof. Mark h = PW(G) and d = d{G) + l. Set k to be such that |Lg ([15 ^])| = h. 
Using Theorem 1 we want to show that 



{r I r = rG{i)n[k + i,n], / g id } 



> 254 



( 1 ) 



For every vertex v G [l,fc] denote = /g({z;}) fl [k + l,n]. We will find a 
specific independent set X of G\ such that 



1. For every m G X, yf 0. 

2. For every u,v & I, Ay C\ Ay = 0 

3. |X| > 



Finding such an X will prove Equation (1), by letting / run over all subsets of 
X. 

Since |XG([l,fc])| = h, then | U Ay\ > h. Therefore there are at least ^ such 
sets Ay yf 4>. Noticing that each vertex w G [k + l,n] can appear in at most d 
sets Ay, and since \Ay\ < d, we have that each Ay intersects at most other 
such sets. By Lemma 5, there are at least 2 ' W ~ ^ 

intersect each other. Denote by iL C [1,/c] the set of v’s corresponding to these 
A„’s. Again, using Lemma 5, and by the fact that |id| > we can find a subset 
X oi H that is an independent set in G. This X satisfies all three properties 
above. □ 



3.3 Optimal Ordering 

The previous results we have shown all consider the natural ordering of variables 
in (p. In the following we extend these results naturally to obtain the connections 
needed between the properties of G and the QOBDD size of an arbitrary ordering 
of p. Let a € S„ and G € Gn- The graph G obtained after a renaming of V 
according to cr is defined as 

G^ = {V,{{a{z),a{j)) \ (z, j) G E(G) }). 

It is not hard to verify that implying that mBDD(i^G) = 

miuo- BDD((/?( g<t)). We now define the minimal pathwidth of a graph. 

Definition 3. The minimal pathwidth of G is mPW(G) = ming. PW(G'^). 

It is straightforward to verify that Lemma 6 and Lemma 4 now imply: 

mPW(G) 

Theorem 2. 2«G)+iP < m BDD((pc) < n(2“PW(G) 

We believe this result to be of independent interest, since it shows the close 
connection between the pathwidth of the graph and the QOBDD size of the 
formula. If all orderings of the vertices result in many clauses being separated - 
the QOBDD size will be large, exponential in the pathwidth. 
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3.4 Minors 

For a graph G G Gn, and an edge (i,j) € E{G), the result of contracting 
the edge (t,j) in G is the graph with the addition of the edges 

{{j,x) I (i,x) G E{G)}. We say is a minor of G if it is the result of con- 
secutive edge contractions of G, vertex deletions and edge deletions of G. In our 
application, El does not have any multiple edges {i.e. H is not a multi graph). 

Lemma 7. If H is a minor of G then mPW(i7) < mPW(G). 

Proof. For one vertex or edge deletion the result is trivial. We therefore prove it 
for one edge contraction and the Lemma follows by induction. Let G G Gn, and 
assume w.l.o.g. that PW(G) = mPW(G). Assume an edge (i,j) is contracted in 
G to give H, where i < j. We claim that the following ordering of H’s vertices 
gives a pathwidth of H which is at most PW(G): 1, 2, . . . , f — 1, i -|- 1, . . . , n. 

1. For all /c < i - 1, Eh{[ 1, k]) = F'g([1, k]) \ {t}. 

2. For all k > j, k] \ {i}) = Eaill, k]). 

3. For alH < fc < j, k] \ {f}) C Fg([1, k] \ {i}) \ {i} U {j} C rG([l, k]) 

And so, for all k: |/A([l,fc] \{f})| < |/g([1, A:])|, to conclude. □ 

4 QOBDD Size of Random 2CNF 

We now proceed to examine the most probable QOBDD size of a random for- 
mula in Gn,p for different values of p. Our analysis is divided into several cases, 
each examining a different range of values for pn. The value pn is (approxi- 
mately) twice the expected ratio between the number of clauses and the number 
of variables in the formula, and is therefore a good indicator for the expected 
structure and complexity of the formula. We prove the following results (with 
high probability over the random formula (p). 

1. For pn < 1 — e, where e > 0 is constant, mBDD((p) = 0(n log n). We will see 
that the probable formulas in this case are very degenerate, since the graph 
will most probably contain only very small connected components. 

2. For 1 -I- e < pn < o(n), where e > 0 is constant, 

2^(p^°g”'”) < mBDD(p) < 2^(7 

This implies that the QOBDD size is highly exponential^ for small values of 
p, and slowly decreases as p approaches 1. For example, when pn = ^/ri, the 
QOBDD size is (which is still highly exponential). Notice the 

sharp jump in the QOBDD size, with respect to the previous case, with a 
very small increase of pn. 

^ For pn > 12 we show an improved lower bound of 2^^ 7 
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3. We improve the bounds above for large values of p. Let p satisfy (a) For every 
constant e > 0, pn > and (b) For every constant a < 1, pn < n — n“. 
{I.e. pn is large but not too large). Then 

/Otf log^ n \ 

mBDD((/j) = 

In this case we get matching lower and upper bounds (up to constant factors 
in the exponent). Since pn < n — n“ for all a < 1, this means that mBDD((p) 
is super polynomial. For example, when p = ^, mBDD(i^) = ") = 

j^e(logn) 

4. If there exists a constant 0 < a < I s.t. pn > n— n“, then mBDD(<^) = 
i.e., is polynomial. 

An important point in these bounds, is that all upper bounds (except the 
one for pn < 1 — e) are derived using Corollary 1, by showing an upper bound to 
the number of satisfying assignments to the formula. The fact that these bounds 
practically match the lower bounds means that the QOBDD reductions are of 
very little use for these kinds of formulas - we might as well have written a list 
of all satisfying assignments as a description of the formula. 



4.1 Case 1: pn < 1 — e 

We start by stating the following theorem appearing in [JLR] which states that 
w.h.p. G’s connected components are all of size at most O(logn) and are all 
almost trees 



Theorem 3. ([JLR]): If G € Gn,p, where pn < 1 — e for some constant e > 0, 
then w.h.p. G’s connected components are of size O(logn), and are either trees, 
or trees with one extra edge. 



We now show that the QOBDD size of a graph that is a tree is small. This 
is done by showing that the pathwidth of a tree is small. Combining these two 
facts we will conclude that w.h.p. mBDD((/?) < 0(n log n). 

Lemma 8. For T G Gn, where T is a tree, mPW(T) < log 2 n 

Proof. If n = 1 then clearly mPW(T) = 0 = log 2 (l). We order the vertices of 
the tree recursively. Number the s subtrees rooted at the children of the root 
vertex r according to their size, i.e., T\ is the largest, T 2 the second, and so on 
until Tg, the smallest subtree. Order each of the subtrees recursively, the vertices 
of Ti are ordered t\,t\, . . .t\^, the vertices of T 2 are ordered and so 

on. Now order all the vertices in the following way: 



/I /I +1 +2 ±2 J.2 ±s ±s ±s 

^ 1 , ^25 • • • ^ 2 ’ • • • ^k2 5 • ' • ^ 2 ’ • • ' ^ks ’ 



We claim that this ordering gives a pathwidth of at most log 2 n. 

1. For k G [1, - 1], /r({ti, . . . , 4 }) = rT^{[t\, . . . ,t\]). By the induction 

hypothesis this set is of size at most log 2 |Ti| < log 2 n. 
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2. For k = ki, rT{{t\, . . . = |{r}| = 1 < log 2 n, since n is at least 2. 

3. For 1 < i < s, for k G rT({t }, . . . . . . ,t\, . . .tl}) = /t(Ti) U . . . U 

) = {t"} U By the induction hypothesis we get that 

this set is of size at most log 2 \Ti \ + 1. However, since i > 1, then Ti is not 
the largest subtree child of r, and therefore must satisfy \Ti\ < ^\T\. Which 
gives log 2 |Ti| + 1 < log 2 n. 

□ 

Theorem 4. If G € Gn,p where pn <1 — e for some constant e > 0, then w.h.p. 
mBDD((/?G) = 0(n log n). 

Proof. According to Theorem 3, w.h.p. G’s connected components C\, . . .Ck are 
all of size at most 0(log n) and are each a tree with maybe an addition of one 
edge. Since an extra edge can increase the pathwidth of a graph by at most 1, 
then by Lemma 8 we have that for all i, mPW(G|^ ) < log 2 |Gi| + 1. Therefore, 
by Theorem 2 we have mBDD(G|^ ) < \Ci\ ■ ( 2 *°S 2 ICil+i _|_ ^ 3|Gip. It is not 

hard to verify that this implies 

k 

mBDD((/?G) < n + ^ mBDD(G|^ ) < n + 3 ^ |Gip 

1=1 i 

Denoting M = max^ \Ci\, we have that mBDD((/?G') < n + and since for 

all i, |Gj| = O(logn), mBDD((/5G) = O(nlogn). □ 

4.2 Lower Bound of Case 2: 1 + e < pn — o{n) 

We start by showing that for pn > 12 w.h.p. mPW(G) > |n. We also show that 
for pn = 0(1), w.h.p. d{G) = O(logn), and now using Theorem 2 we get an 
exponential lower bound for mBDD(i^) in the case 12 < pn = 0(1). From this 
we easily derive a lower bound for larger pn, while pn = o(n). 

The result for 1 + e < pn <12 now follows by finding a minor H of G, that 
has a large pathwidth. We show that G contains a minor H which is actually 
an element of Gi,p', where Ip' > 12, and since mPW(G) > mPW(iL), we get an 
exponential (in 1) lower bound for mBDD(ip). Details follow. 

Lemma 9. For G G Gn,p, where pn > 12, w.h.p., mPW(G) > |n. 

Proof. We show that if pn > 12, then w.h.p., for G G Gn,p, every set V C V{G), 
where \V\ = |n, satisfies |r'G(h^)| > in. This will prove the lemma. 

For fixed A,BCV, where | A| = |n and \B\ = ^n, 

Pr [rdA) CBj = (l-p)I^K"-(l^l+|s|)) = (l-p)5"i" < 

If we have that for all relevant A and B, Pg{A) 2 B then the graph is as we 
want it. We bound the probability of this not happening using a simple union 
bound: 

2 " • 2 " • e~^ = 

This tends to zero if pn > 12. □ 
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It is not hard to verify that w.h.p. the maximal degree d{G) of a graph 
G G Gn,p with pn = 0(1) is of size O(logn). We thus conclude, by Theorem 2 
that 

Corollary 2. For G € Gn,p where 12 < pn = 0(1), w.h.p., mBDD(i^G) > 

We now turn to study values of p that satisfy 12 < pn = o(n). 

Theorem 5. For G £ Gn,p, where 12 < pn = o{n), w.h.p., mBDD(<pG) > 

2^{p ") 

Proof. Set k = ^, and examine the random behavior of which is actually 

an element of Gk,p- Since pn = o(n), p = o(l) and therefore k is unbounded, so 
by Corollary 2, w.h.p. mBDD(ipG| ^ i/p) . Since ^ < n, 
we get i log"^ i i log""‘ n. 

A simple observation is that ii FI = G\^, then mBDD(ip( 3 ) > mBDD((/?//), 
and this gives us the desired result. □ 

It is left to show our bounds for 1 + e < pn < 12. To do so we show that for 
G G Gn,pi pn- > 1 + e, G contains a minor FI that behaves as a random graph in 
Gk,p', where p'k > 12. This, combined with the analysis above will prove that FI 
has large pathwidth. 

Theorem 6. ([JLR]): If G £ Gn,p and pn > 1 + e, for some constant e > 0, 
then there is some constant 0 s.t. w.h.p. the biggest connected component of G 
is of size at least On. 

Theorem 7. For G G Gn.pt where 1 + e < pn < 12 and e > 0 is constant, w.h.p. 
mBDD(pG) > 

Proof. For two reals 0 < Pi,P 2 , < 1, s.t., pi + (1 — Pi)p 2 = P, we can view G as 
the union of two graphs, Gi and G 2 , where Gi G Gn,pi, and G 2 G Gn,p2- Setting 
Pi = ^(1 + f ), we get that | < np 2 < 12. 

In the following, we find a minor Hi of Gi which will contain no edges at all, 
and then consider how the edges of G 2 appear in H\. This gives us a minor H 
of G which will have a large pathwidth. 

By Theorem 6, we have that Gi contains a tree of size On. As before we may 
assume that the maximum degree in this tree is d = O(logn) (this will happen 
w.h.p.). It is not hard to verify that this implies that for any k, G\ contains 
I = ^ disjoint connected sets Vi, . . .Vi, each of size k (such a partition can be 
obtained by traversing the tree mentioned above). Now set k = = O(logn), 

notice that I is unbounded. In the following we assume that both k and I are 
integers, otherwise we must use the [-J notation. 

Define a minor Hi of Gi, by contracting all of the edges internal to each Vj, 
and removing all vertices outside of UiVi, and all edges not internal to the Vi’s 
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- in other words, H\ contains I vertices, and no edges. Define a minor H of G, 
by considering the edges of G 2 as they appear in Hi. An edge of H corresponds 
to k'^ (possible) edges of G 2 , and so will appear with probability ps, 



P3 =P2(1 + (1 -P 2 ) + • ■ • + (1 -P 2 )) > P 2 k {I- P 2 ) 



\k^-l 



> P2k - 
e 



Now, 



On 2 1 

e - ^ = 



According to Lemma 9, w.h.p. mPW(iL) > |/ = ^ ), and by Lemma 7, 

mPW(G) > mPW(iL) > „ )■ Lastly, w.h.p. d(G) = O(logn), and then by 

Theorem 2 we have that w.h.p. mBDD((^G) > 2 viT^T/, to conclude. □ 



4.3 Lower Bound of Case 3: ® < pn < n — n“ 

Notice that the lower bound presented in the previous Section 4.2 is not super 
polynomial if p is taken to be very large (namely for values of p greater than 
1/log® n). In the following section, we study large values of p and obtain super 
polynomial lower bounds. To show a lower bound in these cases, we will work 
directly with Theorem 1 and not with the pathwidth of the graph. To get a lower 
bound using this theorem we need to first estimate the number of independent 
sets in a random graph of Gn,p- 

For the reminder of this section, we will assume (a) For every constant e > 0, 
pn > n^~^, and (b) For every constant a < 1, pn < n — n“. 



Independent Sets in Qn,p- Recall that Theorem 1 shows a connection between 
certain combinatorial properties of G and the QOBDD size of pG- In particular, 
a necessary condition for a large mBDD((/?G) is the existence of many (super 
polynomial) number of independent sets in G. We start by showing this condition 
holds w.h.p. on random graphs in Gn,pj and then use it for proving the lower 
bound of case 3. 

Denote q = 1 — p. We will consider the number of independent sets of size 
k = kc in Gn,p, where k = and therefore Since q > for 

every constant a < 1 , we get that k is unbounded, and we can therefore assume 
A: is a natural number. We take c to be a small constant. Since pn > n^~'^ for 
every constant e > 0, we have k = 0(n*^ log n) for every constant e > 0. Let 
7 > 0 be an arbitrarily small constant, in the following we will use the fact that 
k <n~^ . 

Denote the expected number of independent sets of size kc hy E = Ec- 
Clearly, E = Q)q^^^ ■ It is not hard to verify that E = n^i^i given c is small 
enough. Furthermore, it can be seen (using standard techniques) that the vari- 
ance V of the number of independent sets of size k is at most \E“^. Thus, by 
Chebyshev’s inequality. 
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Corollary 3. For small enough c, the number of independent sets of size k in 
G € Qn,p is with probability greater than 

The constant | bound on the probability obtained in Corollary 3 will not suffice 
for our purpose, and we will therefore amplify the probability of this result. 
Roughly speaking, this is done by applying Corollary 3 on a large class of almost 
disjoint subsets of vertices in G (namely subsets that share at most a single 
vertex) where each subset is of polynomial size. If one of these subsets has many 
independent sets, so does G. Due to space limitations, full proof is omitted. 

Lemma 10. For small enough c, the number of independent sets of size k in 
G € Gn,p is with probability greater than 1 — 2“” 



QOBDD Size Lower Bound. We will now use Theorem 1 to prove the lower 
bound of case 3 on the QOBDD size of G G Gn,p- It is not hard to verify that it 
suffices to prove 



Lemma 11. Let G G Gn,p- Let k = kc be as defined in Section 4-3, For small 
enough c, w.h.p. mBDD(<^G) = 



Proof. By Theorem I it is enough to show that w.h.p., for every set U C [l,n], 
\U\ = y/fi, 

|{CG(/)n([l,n]\t/) I JgID(G|j,)}| >n^«. 

Since this will show, that for every ordering of the vertices of G, the size of the 
^/n + 1 row in (pc's QOBDD is at least We will therefore show that for 

every such U this happens with probability greater than 1 — ^ (J^) ’ 

using the union bound, we get that it is true for all U w.h.p. 

Let Ui and C/2 be two independent sets of size k in G|^. For i = 1,2, let 
Fi = /^(C/i) n ([1, n] \ C/). The probability that a specific vertex is in Ti but not 
F 2 is greater than pq^ , and therefore the probability that there is no such vertex 
in [l,n] \ C/, i.e., Fi = F 2 , is at most. 






< e 



< e 



.,3/4 



where 7 > 0 is an arbitrarily small constant. Since the number of independent 
sets Ui in U is at most \U\^ < then the probability that all 

the sets Fc{Ui) O ([l,n] \ U) differ is at least 

1 - > 1 - 



3 /4 

For a specific U, by Lemma 10, with probability at least 1 — 2“" , the number 

of independent sets of size k in U, is To conclude, 

1 - + 2-”'^") > 1 - > 1 - - f ” ) 

n \^nj 

□ 
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4.4 Upper Bounds of Cases 2,3, and 4 

We now prove the upper bound of case 4. The upper bounds of cases 2 and 3 
are proven similarly (their proof involves setting the parameter k in the proof 
below to 4 

Theorem 8. Let G G Gn,p, where pn > n — rG for some eonstant 0 < a < 1. 
Then, w.h.p. mBDD((/?G) = 

Proof. The expectation of the number of independent sets of size k = \ + 1 
is at most, 

Q(1_p)( 2) = Q(n“-i)(2) =n3'=(2+(«-i)(fc-D), 

Since (a— l)(fc— 1) = (a— 1) |" < —3, the expectation is at most = o(l), 
and so by Markov’s inequality w.h.p. maxID(G) < k. By Proposition 1 and 
Corollary 1, mBDD(<^G) <n-n^ = □ 
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Abstract. Symbolic reachability analysis based on Binary Decision Di- 
agrams (BDDs) is a technique that allows the implementation of effi- 
cient state space exploration algorithms. However, in practice it is well 
known that the BDD blowup problem limits the size of the systems that 
can be analyzed. Conversely, simulation is a low-cost state generation 
technique, although its effectiveness is limited due to its inherent se- 
quentiality. We present a hybrid methodology that combines simulation 
and symbolic traversal in order to improve the state space exploration of 
large systems. The methodology concentrates on asynchronous concur- 
rent systems, whose peculiarities are not fully exploited by other existing 
techniques for hybrid verification. Our approach exploits the information 
obtained from simulations to improve the knowledge of the state space, 
effectively guiding symbolic traversal. We demonstrate the applicabil- 
ity of this methodology in the verification of complex control-dominated 
asynchronous circuits. 



1 Introduction 

State space computation is the main bottleneck for most formal verification tech- 
niques. As an example, for invariant verification all reachable states of the system 
are calculated and the desired invariants are checked to hold in all of them. If the 
system fails to satisfy the invariants, it is necessary to identify a counter-example 
that reproduces the sequence of actions that the system performs before failing. 
The computational complexity of invariant verification is revealed when systems 
that exhibit high degrees of concurrency with irregular state spaces are analyzed 
(the well-known state explosion problem). In those cases, even the utilization 
of BDD-based symbolic techniques [1,2] does not allow the complete analysis of 
the state space. 

In recent years mixed approaches combining simulation and formal verifi- 
cation have been introduced, coining the term hybrid verification [3]. Instead 
of ensuring the complete exploration of the state space, hybrid verification in- 
tends to provide efficient mechanisms to identify significantly large portions of 
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Fig. 1. Two-step scheme: simulation followed by guided-traversal. 

the space space with a reduced computational complexity. Hybrid verification 
has been traditionally useful when the size of the system under analysis is too 
large to be fully verified by conventional means. In these cases, hybrid verifica- 
tion provides the designer with positive feedback to improve the reliability of 
the system in terms of failures discovered in a first step of the verification flow. 
These techniques may also help in the early stages of the design, when failures 
are not real design errors but holes in the specifications. 

This paper presents a hybrid reachability strategy tailored for asynchronous 
concurrent systems, i.e. to consider the interleaved execution of concurrent 
events. We propose a two-step mechanism based on a combination of simula- 
tion and reachability analysis (see Figure 1). In a first step, simulation provides 
an initial depth-first view of the states in the system. In order to guarantee a 
good coverage of the state space, simulation detects those states where the sys- 
tem chooses between alternative execution sequences {i.e. branching sequences). 
Then, each one of the possible alternative sequences will be further explored. 
The analysis introduced in this work guarantees that interleaving branches due 
to concurrency will not be exhaustively explored during simulation. Conversely, 
only one of the sequences is explored, resembling those techniques used in partial 
order reduction methods [4,5]. 

In a second step symbolic traversal is applied to improve the state cover- 
age. The information about the ordering in which events are fired, obtained by 
simulation, is used to guide the way in which traversal is applied. Reachability 
analysis is performed for each one of the sequences generated by the simulation 
phase, accumulating the obtained states. 

The remainder of the paper is organized as follows. Section 2 introduces exist- 
ing previous research related to our methodology. Section 3 provides background 
on the model used for asynchronous concurrent systems, and on the peculiari- 
ties of their reachability analysis. The proposed simulation scheme is described 
in Section 4. The analysis of the dynamic behavior of the system and its appli- 
cation to guided traversal is described in Section 5. Experimental results on the 
application to invariant checking on control-dominated asynchronous circuits are 
analyzed in Section 6. Section 7 concludes the paper. 
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2 Previous Work 

State space exploration using guided techniques has become subject of wide 
interest. These techniques tackle the guidance of reachability analysis toward 
failure detection rather than to complete state space computation. Guided search 
typically uses “score-boarding” to find sequences from the initial states to failure 
states. Various metrics have been proposed to prioritize the state exploration 
based on the Hamming distance [6], tracking [6], reachability probability [7], 
lighthouses or guide-posts [8,6,9], and rarity search [10]. 

Several techniques have been introduced to guide the search toward uncov- 
ered regions of the state space. Ganai et al. [8] introduced a combination of 
adaptive simulation with retrograde analysis. Adaptive simulation is based on 
random simulation with a backtracking mechanism to avoid getting stuck dur- 
ing the search. Retrograde analysis involves a combination of forward analysis 
with pre-images from the failure states. Bloem et al. [11] use hints to guide the 
symbolic search and to alleviate the BDD explosion problem. Each hint indicates 
which portion of the transition relation should be used at each step to avoid a 
BDD blowup. Ganai et al. [8] and Yang et al. [6] suggest the manual insertion 
of guide-posts. User defined guide-posts are variables inserted in the system, 
which if activated during the traversal indicate that we are in the right way to 
find a failure. In [9] an automatic guide-post insertion mechanism is proposed. 
Kuehlmann et al. [7] suggest using the state reachability probability as a guide 
for state prioritizing. Again, Ganai et al. [10] propose a rarity-based guide that 
tracks latch toggle activity to improve state coverage. 

Some authors suggest the combination of symbolic reachability analysis with 
BDD-subsetting. In [12] when the BDD representing the state space grows be- 
yond a certain limit, a subset is taken such that the BDD size is reduced but a 
large fraction of the state space is kept. [3] attempts to improve the subsetting 
mechanism by differentiating control and data-path, and keeping subsets that 
preserve all possible control behaviors. 

The work presented in this paper resembles some of the strategies using by 
partial order reduction techniques [4,5]. However, some key aspects differentiate 
our approach from these techniques. First of all, the goal of the approach is to 
generate the largest possible portion of the state space. This goal is radically op- 
posite to partial order reduction. Second, the state successors to be explored are 
selected taking into account exclusively the causality relations between events in 
the system. No assumption is made on the type of temporal property being veri- 
fied. Additionally, the reduced state space is never rebuild, only finite sequences 
of states are generated. 




Efficient Hybrid Reachability Analysis for Asynchronous Concurrent Systems 



381 




Fig. 2. An example of transition system. 



3 Background 

3.1 Transition Systems 

ATS is a formalism oriented to modeling asynchronous concurrent systems that 
emphasizes the execution of abstract events rather than how they are encoded. 
Events may have different semantics depending on the level of detail of the model 
(signal changes, protocol operations, etc). The concurrent execution of events is 
described by means of interleaving, i.e. weaving the execution into sequences. 

Formally, a transition system (TS) [13] is composed of a non-empty set of 
states S, a non-empty alphabet of events S, a transition relation T C SxSxS, 
and a set of initial states Sin - Transitions are denoted by s — >s' . The firing region 
of an event e is defined as Fr : if — >■ 2'^ such that Fr(e) = {s S S' | 3 s— ^s' € 
T} . Thus, event e is firable at state s if 3s— ^s' G T, i.e. s G Fr(e). The 
set of events firable at state s is denoted by S(s) .A run of a TS is a firing 
sequence a = Si— • • • , such that Si G Sin and Vi > 1 : G T. 

Given the significance of individual events, the transition relation (TR) of a TS 
can be naturally partitioned into a disjoint set of relations, one for each event 
e G if: Te = {s-^s' G T \ 3s, s' G S}. 

Figure 2 shows a TS that will be used as a running example. The system 
contains 22 states and a set of events S = {a, b, c, d, e, f, g}. State Si is its initial 
state. Note the existence of multiple interleaving sequences due to concurrency, 
e.g. a — — >-c and a — — ^b. 

3.2 Reachability Analysis 

The set of states that is reachable in any number of steps from a set of states C 
{Reach{T, C)) is defined as the least fix-point of the following recurrence: 
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Fig. 3. General example for the chaining technique. 



So = C (1) 

Si +1 = SiU Img{T,Si) . 

where Img(T, Si) is the one-step image computation applying the TR T on a set 
of states Si- When, C equals S™ for a given TS, this algorithm generates the 
state space of a system in a Breath First Search (BFS) style. The number of 
iterations performed by such traversal is determined by the maximum number 
of steps from the initial state to the first occurrence of each of the reachable 
states (called the sequential depth of the TS). In the example of Figure 2, the 
application of BFS from the initial state gives state S 2 in a first step, states 
S 3 ,S 5 ,Si 3 in a second step, states S 4 , Sg, se, Sio, S 14 , Siy, S 15 in a third, etc. 

The classical BFS algorithm can be improved based on two key observations. 
First, at each iteration of the BFS traversal, most transitions described in the 
monolithic TR are not applied {e.g. at the second BFS step, only events b,c,g 
are significant). And second, the TR of a TS can be naturally partitioned into 
disjunctive TRs, one for each event, that can be applied individually. 

These observations have suggested alternative traversal algorithms, named 
chaining [14,15]. Chaining applies the individual TRs of events in a predeter- 
mined order such that the number of new states generated at each step is max- 
imized. After the application of the transition relation of an event, the newly 
generated states are immediately used as domain for the next event in order, 
hence coining the term chaining. 

Figure 3 shows the general concept for two TRs A and B. If A and B are 
applied to the same set FROM in a BFS style, a certain number of states is 
reached (see Figure 3(a) and (b)). However, chaining would apply A to FROM 
and generate a new set of states (FROM + TO(A) in Figure 3(c)), and afterward 
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apply B to this set (in Figure 3 (d)). The number of reached states increases with 
almost the same computational effort. 

In practice, chaining can significantly reduce the number of iterations of 
the BFS algorithm [ 15 , 16 ]. The method is specially effective if the appropriate 
firing order of the events is selected. Chaining outperforms BFS techniques in the 
verification of asynchronous concurrent systems because states are computed at a 
much faster ratio and with less effort, thus reducing the number TR applications. 
Moreover, partitioning the TR provides important memory savings and CPU 
speed-ups when implemented over BDD structures. 



4 Simulating Transition Systems 

This section presents a simulation approach for asynchronous concurrent systems 
that automatically provides a good state space coverage. At each explored state 
the causality between firable events is analyzed to identify firing conflicts between 
them. Conflict detection allows to identify execution sequences that exclude 
each other in a variety of ways, including mutual exclusion for example. This 
simulation scheme resembles those state exploration techniques used by partial 
order reduction [ 4 , 5 ]. 

Simulation chooses a particular firing order among all possible interleaved 
executions of concurrent events. Thus, simulation alone has limited coverage 
effectiveness for concurrent systems. We will show in Section 5 that the inter- 
leaving of events due to concurrency can be explored more efficiently by symbolic 
traversal once the information from a particular simulation sequence is available. 



4.1 Conflict Detection to Improve Coverage 

Conflict detection is the key mechanism that allows to distinguish between se- 
quences of events representing alternative behaviors of a system or interleaving 
sequences of concurrent events. The first type of sequences are relevant and must 
be explored in order to guarantee a good coverage of all possible behaviors of 
the system. Exploring interleaved sequences must be avoided and postponed to 
the symbolic traversal phase. 

An event ei disables another event e.2 if a pair of states exists Si,S2 such 
that Si— -!-5-S2 G T and 62 is firable in Si (62 G ifi(si)) but 62 is not firable in S2 
(e2 ^ 'fi(s2))- Two events 61,62 are in conflict if ei disables 62 or 62 disables ei. 
A conflict is called symmetric if ei disables 62 and 62 disables ei. The conflict is 
called asymmetric if ei disables 62 but 62 does not disable ei, or vice versa. 

Figure 4 depicts a portion of the state space of a concurrent system. The 
figure illustrates the conflict situations previously described. From the initial 
state (a) shows three events ei, 62, 63 that are mutually concurrent; (b) shows 
a symmetric conflict between ei and 62; and (c) shows an asymmetric conflict 
in which 62 disables ei but not the contrary (event 63 remains concurrent to ei 
and 62). 
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(a) (b) (c) 

Fig. 4. Concurrent and conflict situations. 



A state in which two or more events are in conflict is called a branching state 
in which alternative execution sequences exist (see Figure 1 ). Each separate 
sequence can be followed, resulting into different behaviors of the system. 

Symmetric conflicts are associated to states in which the system takes a 
decision. The behavior in each branch may involve completely different sets of 
events and thus produce distinct /disjoint sets of states. A different simulation 
sequence is generated for each branch in order to achieve a better coverage of the 
state space. On the contrary, asymmetric conflicts can be associated to disablings, 
in which the firing of one event (the disabler) prevents the firing of a second event 
(the disabled). In this type of conflict two different firing sequences exists. In one 
of the sequences, both events can Are concurrently and no disabling occurs. In 
the other sequence, the disabler event fires thus disabling the second event and, 
in consequence, disabling also some part of the system behavior. As an example, 
disablings can be associated to races in digital circuits {e.g. producing either 
glitches or dead- locks at the output of some gates) . 

4.2 Simulation Algorithm 

This section presents an improved simulation mechanism based on the analysis 
of the conflicts found along the simulation sequences. Every time a pair of con- 
flicting events is identified, a new sequence is generated and stored in a list of 
pending sequences. The sequence duplication scheme is detailed in Figure 5 . In 
the example there is a firing sequence (cti) in which three events 61,62,63 are 
Arable in state Si. Let us assume that events ei and 62 are in symmetric conflict. 
A copy of the branch state Si is generated (s'^), together with a copy (a[) of the 
sequence (up to state Si) being explored. The exploration continues by removing 
the disabled event (62) from the list of Arable events at Si. Then, event ei is 
fired in the active sequence tJi generating a state where only 63 remains Arable. 
On the other hand, cr( is stored for later exploration. The disabled event (ei) is 
removed from the list of Arable events at state . Note that the order in which 
events have been selected it is not necessarily the order in which our algorithm 
may proceeed. In that case, concurrent events like 63 may be given priority. 

Simulation sequences are processed following a configurable priority scheme. 
States can be analyzed following a DFS or BFS style, or a mixture of both. 
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el and e2 in 
symmetric confiict 






Fig. 5. Branching sequences due to conflicts. 



However, other parameters can be taken into account, e.g. the number of choices 
already taken. The firing order of the events can be also decided according to 
some priority scheme. In our simulator, we keep track of the number of times 
that each event is fired. To avoid locking the state exploration in some local 
region of the state space, we give additional priority to those events which have 
been fired less often. 

The algorithm in Figure 6 describes the suggested simulation scheme. The 
simulation engine stores the set of sequences, together with the events that are 
ready to fire, in the active list. Sequences are stored as linked lists of HDD 
cubes, each cube representing a state of the system. Terminated sequences due 
to state repetition, deadlocks or simulation limits are stored in seq. All states 
visited along the simulation are stored in visit. Each sequence analyzed along 
the simulation is stored as a tuple r = (s, cr, E, D, B) that consists of: s the last 
state in the sequence; cr the firing sequence required to reach s from S^; a set 
E C S that indicates the events that remain Arable at s; and two integers to 
indicate the firing depth D with respect to Sin, and the number of taken choices 
B required to reach that depth. 

Without loss of generality we will assume that the simulation starts from a 
single initial state Si„. A tuple is created for this state by using the empty se- 
quence {sin,} and all possible Arable events (retrieved by function f irable(si„)). 
This initial sequence is placed into the list of active sequences pending of being 
processed. 

The simulator takes one sequence from the list of pending sequences. The last 
state of the sequence is checked to determine if the simulation should proceed 
from it. If the state has been already visited, or simply the depth/branch limit 
have been surpassed, the sequence is stored in seq. If the last state can be 
processed, a Arable event e G t.E is selected. Events can be selected giving 
priority to either: events that are not in conAict, events that are in symmetrical 
conAict and events that are in asymmetric conAict. If event e is in conAict ** , 
then sequence r is duplicated into an exact copy E . Event e is marked as non- 
Arable in E to avoid exploring the same sequence multiple times (see Figure 5). 
Finally, / is inserted back into the list of active sequences for a later exploration 
of alternative branches. 
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1 . 


visit := seq active := 0; 




2 . 


T := alloc^sequence (si„, {si„}, f (si„] 


1 ,0,0); 


3 . 


active active U r 




4 . 


while (active 7^ 0) do 




5 . 


r := get^priorized^sequence (active) 


6 . 


visit visit U r.s 




7 . 


if (termination_condition (r)) 


then 


8 . 


seq seq U t.<j 




9 . 


freesequence (r) 




10 


continue ; 




11 


e select^firable^event (t.E) 




12 


if {event-disahles then 




13 


t' := duplicate_sequence 1 


(r) 


14 


t'.E ■- t'.E \ e 




15 


t'.B = t'.B + 1 




16 


active := active U r' 




17 


r.s := Img{Te,T.s) 




18 


t.D := t.D + 1 




19 


t.E := firable(T.s) 




20 


e 

T.a := T.a — >s 




21 


active := active U r 





Fig. 6. Pseudocode of the simulation algorithm. 



The selected event e is fired from state r.s, generating its successor 
Img{Ts,T.s) that is updated in the sequence. The remaining information is also 
updated, including the extension of the firing sequence by ct— ^ s. Finally, r is 
updated and placed back into the list of active sequences. 

Given the example of Figure 2, the simulator generates two firing sequences 
(assuming an alphabetical firing order of the events) shown in Figures 7(a) and 
8(a), respectively. Two sequences are generated because a conflict is detected at 
state S2 between events b and g, when b is selected to fire. A third sequence is 
also generated, although not shown due to lack of space, because events c and 
g are also in conflict at S2. Note that sequences are annotated with the events 
firable at each state. 

This simulation scheme allows a fast in-depth analysis of the system, provid- 
ing a good state coverage since all conflict branches are identified. A number of 
heuristic termination conditions are included to avoid repeating equivalent exe- 
cution sequences: stop exploring a sequence whenever an already visited state is 
reached, and bound the depth of the sequences to a factor of the total number 
of events in the TS. 
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5 Guided Traversal 

This section shows how the sequences generated by the simulation phase, to- 
gether with the information about which events are Arable at each state, allows 
analyzing the causality relations between events. Such causality is later exploited 
to improve the symbolic traversal by using chaining. An efficient traversal algo- 
rithm is applied for selected sequences, thus improving the state coverage. Fol- 
lowing the ideas in [16], the TR of the system is partitioned, and the application 
of each part is scheduled by analyzing the causality relations between the events 
in the sequence, thus maximizing the state generation ratio. 

5.1 Extracting Causality Relations 

Causal event structures (CES) describe all possible sequential and concurrent 
executions of a set of events. A CES [17] is a tuple {S,^) where S = 
{ei,...,e„} is a finite set of events and A C S x. S is a strict partial 
order (irreflexive and transitive) over S called the causality relation. 

Given a CES the following relations can be defined where co is called the 
concurrency relation: 



id = {(e,e) j e e A'} 

^ — {(61:62) I (62,61) ga} 

CO ‘‘= S X S — {^ LI >~ LI id} 

Provided a firing sequence of events, we can recover a partial order showing 
the causality relationships among those events, i.e. a CES. A partial order ^ over 
a set of events S and the associated relations , id and co completely 
partition S x S . We can use this fact to derive a CES from a sequence cr , such 
that: id is obviously defined; ei co 62 if ei and 62 are Arable simultaneously 
in some state visited by a ; and ei ^ 62 if ei precedes 62 and are not Arable 
simultaneously in a (similarly for ). 

Figures 7(b) and 8(b) show the CESs derived from the sequences in Fig- 
ures 7(a) and 8(a), respectively. Arcs denote causality relations between events. 
For sake of clarity, we also indicate with dotted arcs the conAict relations, al- 
though they are not part of the actual CES. Thus, if event b disables g, a dotted 
arc from b to g is drawn. In Fig. 7, note that event g appears two times. The 
Arst time is disabled by b, while the second one is Ared after g. 

5.2 Reachability Analysis 

A topological order of the events of a CES is a sequence ei • • • e„ G A* (n = IT’D, 
such that all e^ are distinct and V 1 < i, j < n : e^ ^ e^ i < j ■ Firing the 
events following such topological order often guarantees that when an event is 
Ared all its causal predecessors have been already Ared. Given an event e^ ready to 
Are, if all the events concurrent to e^ are Ared before e^, most states in Fr(ei) will 
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Fig. 8. Sequence 2: (a) Firing sequence, (b) CES capturing causality, and (c) states 
reached after guided traversal. 



be already reached. A traversal algorithm in which events are fired following the 
topological order guarantees a good effectiveness. Unfortunately, the causality 
relations of a complex system cannot be described with a single CES. A pair of 
events may be causally ordered in some part of the state space, whereas they 
may be concurrent in another. On the contrary, causality relations derived from 
a single firing sequence provides a quite precise approximation of the behavior at 
localized areas of the state space. Hence, traversal can successfully exploit that 
information in those cases. 

Given a set of sequences generated from the described simulation process, we 
propose the following three-step traversal strategy (see Figure 9): Generate the 
CES for each firing sequence (2). Find a topological order of the events in the 
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1 . 

2 . 

3. 

4. 

5. 



' ^in j 

CS := build_CES{a)] 

Torder := f ind-topol ogi Cal-Order [C S) \ 
foreach e; £ Torder do 

Sa := SaU Img{Te^,So); 



Fig. 9. Guided traversal algorithm. 



CES(3). Execute a symbolic traversal algorithm from the initial state by applying 
the TR of each event in sequence (4-5). Events will be applied following the 
topological order extracted from the CES. The states generated after the image 
computation of one event will be immediately applied as domain for the image 
computation of the successor event in order, thus chaining the effect. 

Note that, in practice, not all firing sequences need to be considered for 
traversal. Some sequences will be almost equivalent to other, with only a small 
suffix of the simulation being different. In those cases, causality analysis and 
traversal should be only applied to the suffix. Otherwise large amounts of states 
will be repeatedly generated from different sequences. 

Figures 7(b) and 8(b) show the CES annotated with an index that indicates 
the position in the topological order selected for traversal. Observe that the 
disabled events are not annotated since they do not actually belong to the CES. 
Figures 7(c) and 8(c) show the portions of the state space in the original TS 
(see Figure 2) generated by the guided traversal for each firing sequence. After 
both traversals only state Sio remains unreached. Note that in Figure 8, event 
d appears two times along the sequence. In that case, duplicated events are 
renamed to satisfy the topological order requirements. 

5.3 Methodology Implementation 

Figure 10 sketches an implementation of the proposed strategy for hybrid explo- 
ration of the state space. The process is divided in two parts: a simulation phase 
followed by a traversal phase. 

Simulation phase: From the initial state, the simulation engine generates 
multiple branching sequences. The number of sequences will be determined by 
the set of conflicts found during the state exploration, or limited by a user-defined 
limiting parameter. Causality is extracted and attached to each sequence to be 
used later in the traversal phase. 

Traversal phase: Sequences are iteratively taken to apply symbolic traversal 
on them. Heuristically, we choose the sequence that contains more states not 
covered by previous sequences. Note that if all states in a sequence are already 
contained in the set of reached states so far, the sequence will be discarded for 
traversal. The events in the selected sequence are fired following a topological 
order. The order is either extracted from the causality information attached to 
the sequence, or directly taken from the order in which events are fired along 
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Simulation 

Phase 



Traversal 

Phase 



Fig. 10. Structure of the hybrid traversal methodology. 



the sequence (also a valid topological order). Events are iterated once if applied 
from the causality information, or until a fix-point is reached if applied from the 
simulation order. 

6 Experimental Results 

In the following tables several asynchronous concurrent systems are analyzed 
using the hybrid reachability scheme described in this paper. A brief description 
of each systems follows: 

PCC Pausible clock controller for heterogeneous systems in [18]. 

GALS Globally- Asynchronous Locally-Synchronous design in [19]. 
RGD-arbiter asP*, RGD arbiter in [20] described at transistor level. 
IPCMOS A pulse-based controller for asynchronous pipelines in [21]. 

STARI A self-timed pipeline in [22]. 

All these results are from executions on a 2Ghz Pentium IV Linux computer 
with 512Mb of memory. Note that the behavior of all these systems is delay- 
dependent. In our experiments we only concentrate on the untimed state space. 

Table 1 compares the results of full reachability analysis when using different 
traversal strategies on the selected benchmarks. Suffix C is used for circuits and 
A for abstractions. The number in parenthesis indicates the number of stages in 
case of pipelines. We provide results for BPS traversal (BPS), chained traversal 
using a greedy ordering strategy (C Greedy) [14], and a token-traverse chained 
strategy (C Token) [16]. Our goal when presenting these experiments is to demon- 
strate the significant impact that the chaining methodology has on the efficiency 
of traversal. Moreover, we will use these results as a reference to evaluate the 
proposed hybrid methodology. 
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Table 1. Experimental results: various forms of traversal. 







1 BPS 1 


C Greedy 


1 C Token 1 1 


Name 


States 


Iter 


BDD 


CPU 


Iter 


BDD 


GPU 


Iter 


BDD 


CPU 


GALS-C 


1.232e+3 


68 


10498 


42.7 


17 


10914 


0.2 


10 


10767 


0.2 


PCC-C 


9.89184e+5 


64 


80979 


42.4 


15 


18104 


2.6 


5 


12573 


2.7 


RGD-arbiter-A 


3.33813e+9 


79 


218088 


695.7 


20 


113757 


22.6 


5 


13938 


6.1 


RGD-arbiter-G 


5.46918e+13 






Tout 


27 


823820 


1469.5 


16 


44238 


46.0 


IPCMOS-C (4 c) 


8.15635e+9 






Tout 


30 


80479 


51.6 


10 


121994 


44.1 


IPCMOS-C (6 c) 


1.78657e+14 






Tout 


41 


126707 


41.3 


13 


207124 


19.1 


IPCMOS-A (4 c) 


1.16785e+7 


129 


201380 


96.9 


12 


153160 


28.0 


11 


223037 


48.4 


IPCMOS-A (6 c) 


9.15592e+9 


237 


209198 


1055.1 


16 


54978 


22.1 


8 


188061 


27.3 


STARI-G (8 c) 


1.07225e+12 






Tout 


56 


170544 


105.5 


11 


219575 


73.0 



Table 2. Experimental results: simulation followed by guided-traversal. 





1 Simulation | 


1 Traversal | 


Name 


Seq 


BDD 


States 


CPU 


Seq 


BDD 


States 


CPU 


GALS-C 


27 


13485 


381 


0.5 


1 


16208 


1.232e+3 


0.8 


PCG-G 


1 


9120 


306 


0.5 


1 


21185 


9.89184e+5 


3.7 


RGD-arbiter-A 


17 


10493 


142 


0.5 


1 


33355 


1.05433e+9 


2.7 


RGD-arbiter-G 


30 


17480 


221 


1.2 


1 


148711 


9.18829e+12 


17.4 


IPCMOS-G (4 c) 


1 


8088 


179 


0.3 


1 


99799 


8.05928e+9 


21.6 


IPCMOS-G (6 c) 


1 


15191 


263 


0.6 


1 


278575 


1.75992e+14 


14.9 


IPCMOS-A (4 c) 


1 


13727 


133 


0.3 


1 


151493 


1.16785e+7 


25.6 


IPCMOS-A (6 c) 


1 


28481 


241 


0.9 


1 


179577 


9.15592e+9 


32.9 


STARI-C (8 c) 


8 


141299 


5646 


16.9 


2 


283725 


9.73548e+ll 


126.0 



The first column in Table 1 shows the total number of states (States). The 
second set of columns shows the number of iterations (Iter) of BFS traversal, the 
peak BDD size (BDD) and the computation time (CPU) (in seconds). The third 
set of columns shows the same parameters but for the chained traversal with 
greedy ordering. The last set of columns shows the same parameters but for the 
token traverse chained strategy. Note that in both modified traversals Iter refers 
to the iterations of the algorithm, not to the sequential depth of the experiment. 

Table 2 shows the results of the hybrid traversal strategy. The first set of 
columns provides data to evaluate the simulation phase. Column Seq indicates 
the total number of sequences that have been explored; column BDD shows the 
peak BDD size; column States shows the total number of states visited along 
the simulation; finally CPU indicates the computation time in seconds. It is 
important to note that the ratio of visited states versus CPU time is small 
compared to standard simulations. The reason is that, at each state, conflict 
relations should be analyzed penalizing the simulation efficiency. However, this 
initial effort should pay off later in the traversal phase. 

The second set of columns provides data to evaluate the traversal phase. 
Column Seq indicates the subset of sequences that have been traversed; column 
BDD shows the BDD peak size during traversal. Column States shows the states 
reached after guided-traversal; and column CPU indicates the computation time 
in seconds for this phase. Note that some BFS results not shown are due to a 
CPU time-out set to 1 hour. 

The initial set of experiments, BFS traversal versus chained traversal high- 
lights the importance of a good chained strategy. The impact in both number of 
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iterations and peak BDD size allows reducing the computation times for traver- 
sal. 

These preliminary experiments show that significant portions of the state 
space can be reached by our hybrid approach in reduced CPU times. BDD 
sizes remain reasonable for all examples, as expected, due to the spatial locality 
obtained by the guided-traversal step. In addition, the portion of the state space 
generated by the approach is a good starting point to execute symbolic traversal 
until the full state space is reached. 

In the future we intend to improve the strategies used to select which events 
must be fired first during simulation. These strategies should influence to a great 
extent the coverage of the state space achieved during simulation, and later on 
during guided-traversal. We also want to explore in more detail how the firing 
order for events influences in the BDD sizes and the state coverage. 

7 Conclusions 

We believe that the incremental analysis of the state space of a system by tech- 
niques that exploit state locality is the key for the success of traversal algorithms. 
Instead, most existing approaches try to exploit the locality available in the tran- 
sition relations to minimize them, rather than considering the impact in the rep- 
resentation of the state space. Following this line of reasoning, we have proposed 
a two-step hybrid reachability analysis strategy that combines fast simulation 
and guided-traversal. Simulation provides information to identify subsets of the 
state space in which the causality between events can be properly identified. 
This information can be exploited in a second phase. Causality provides enough 
information to efficiently generate large portions of the state space. Addition- 
ally, information about good chaining order is also extracted, which is used to 
guide the later traversal. The combination of both strategies should allow the 
reduction of BDD sizes as well as the execution times. 
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Abstract. In this paper we present an explicit disk based verification 
algorithm for Probabilistic Systems defining discrete time/ finite state 
Markov Chains. Given a Markov Chain and an integer k (horizon), our 
algorithm checks whether the probability of reaching an error state in at 
most k steps is below a given threshold. 

We present an implementation of our algorithm within a suitable ex- 
tension of the Mur^ verifier. We call the resulting probabilistic model 
checker FHP-Murip [Finite Horizon Probabilistic Murt/s). 

We present experimental results comparing FHP-Mur</9 with (a finite 
horizon subset of) PRISM, a state-of-the-art symbolic model checker 
for Markov Chains. Our experimental results show that FHP-Mur^ can 
handle systems that are out of reach for PRISM, namely those involving 
arithmetic operations on the state variables (e.g. hybrid systems). 



1 Introduction 

Model checking techniques [5,11,16,15,21,28] are widely used to verify correctness 
of digital hardware, embedded software and protocols by modeling such systems 
as Nondeterministic Finite State Systems (NFSSs). 

However, there are many reactive systems that exhibit uncertainty in their 
behaviour, i.e. which are stochastic systems. Examples of such systems are: fault 
tolerant systems, randomized distributed protocols and communication proto- 
cols. Typically stochastic systems cannot be conveniently modeled using NFSSs. 
However, they can often be modeled by Markov Chains [2,12]. Roughly spea- 
king, a Markov Chain can be seen as an automaton labelled with (outgoing) 
probabilities on its transitions. 

For stochastic systems correctness can only be stated using a probabilistic 
approach, e.g. using a Probabilistic Logic (e.g. [32,8,13]). This motivates the de- 
velopment of Probabilistic Model Checkers [9,1,17], i.e. of model checking algo- 
rithms and tools whose goal is to automatically verify (probabilistic) properties 

* This research has been partially supported by MURST projects: MEFISTO and 
SAHARA. 
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of stochastic systems (typically Markov Chains). For example, a probabilistic 
model checker may automatically verify a system property like “the probability 
that a message is not delivered after 0.1 seconds is less than 0.80”. 

Many methods have been proposed for probabilistic model checking, e.g. [10, 
3,8,13,14,19,24,27,32]. 

To the best of our knowledge, currently, the state-of-the-art probabilistic mo- 
del checker is PRISM [25,1,18]. PRISM overcomes the limitations due to the use 
of linear algebra packages in Markov Chain analysis by using Multi Terminal 
Binary Decision Diagrams (MTBDDs) [6], a generalization of Ordered Binary 
Decision Diagrams (OBDDs) [4] allowing real numbers in the interval [0, 1] on 
terminal nodes. More precisely, PRISM can carry out the required Markov Chain 
analysis using a matrix based approach (based on linear algebra packages), a 
symbolic approach (based on the CUDD package [7]) as well as a hybrid ap- 
proach. The user can choose the best approach for the problem at hand. 

Here we are mainly interested in automatic analysis of discrete time/ finite 
state Markov Chains modeling Discrete Time Hybrid Systems. Such Markov 
Chains can in principle be analyzed using PRISM. However, our experience is 
that, using PRISM on our systems, quite soon we run into a state explosion 
problem, i.e. we run out of memory because of the huge OBDDs built during the 
model checking process. This is due to the fact that hybrid systems dynamics 
typically entails many arithmetical operations on the state variables. This makes 
life very hard for OBDDs, thus making usage of a symbolic probabilistic model 
checker (e.g. like PRISM) on such systems rather problematic. 

Indeed our experience shows that Explicit Model Checking can outperform 
Symbolic Model Checking in automatic analysis of Hybrid Control Systems [22]. 
This suggested us to explore the possibility of devising an explicit disk based 
algorithm for automatic Finite Horizon safety analysis of Markov Chains. In 
this paper we present our algorithm as well as experimental results showing its 
effectiveness. Our results can be summarized as follows. 

— We present (Sections 3, 4) an explicit algorithm for automatic verification 
of discrete time/finite state Markov Chains. Given a Markov Chain Ai, our 
algorithm checks wheter the probability of reaching a given state s within k 
steps is less than a given bound p. Our algorithm is disk based, thus, because 
of the large size of modern hard disks, state explosion is hardly a problem 
for us. Computation time instead is our bottleneck. Our algorithm can trade 
RAM memory with computation time, i.e. the more RAM available the faster 
our computation. To the best of our knowledge, this is the first time that 
such a disk based algorithm for probabilistic model checking is proposed. 

— We present (Sections 5) an implementation of our algorithm within the Mur(p 
[21] verifier. We call the resulting probabilistic model checker FHP-Mur(p 
{Finite Horizon Probabilistic Muri^). 

— We present (Section 6.1) experimental results comparing FHP-Mur(/? with 
PRISM on two suitably modified versions of the dining philosophers protocol 
included in the PRISM distribution. Our experimental results show that 
FHP-Murtp can handle systems that are out of reach for PRISM. However, 
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as long as PRISM does not hit state explosion, PRISM is faster than FHP- 
Mun^ (as to be expected). 

Note however that PRISM can handle more general models than FHP-Murtp, 
and can verify more general properties (namely all PCTL [13] properties) 
than FHP-Mur(/5. In fact, FHP-Mur(p can only verify finite horizon safety 
properties for Markov Chains, a subclass (although an important one) of the 
verification tasks that PRISM can handle. 

— We present (Section 6.2) experimental results on using FHP-Mur(/? for a 
probabilistic analysis of a “real world” hybrid system, namely the Turbogas 
Control System of the Co-generative power plant described in [22] . Because 
of the arithmetic operations involved in the definition of system dynamics, 
this hybrid system is out of reach for OBDDs (and thus for PRISM), whereas 
FHP-Mur(p can complete (finite horizon) verification within reasonable time. 



2 Basic Notation 

Let S' be a finite set. We regard functions from S to the real interval [0, 1] and 
functions from S x S to [0, 1] as row vectors and as matrices, respectively. If x 
is a vector and s G S we also write Xg or (x)g for x(s). If P is a matrix and 
s,t £ S we also write or (P)g,t for P(s,t). On vectors and matrices we 
use the standard matrix operations. Namely: xP is the row vector y s.t. yg = 
SjesXjPi.g and AB is the matrix C s.t. Cg,t = define A” 

in the usual way, i.e.: A'^ = I, = A"A, where I {the identity matrix) is 

the matrix defined as follows: I(s, j) = if (s = j) then 1 else 0. We denote with 
B the set {0, 1} of boolean values. As usual 0 stands for false and 1 stands for 
true. 

We give some basic definitions on Markov Chains. For further details see, 
e.g. [2]. A distribution on S' is a function x : S — >■ [0,1] s.t. 

Thus a distribution on S can be regarded as a jSj-dimensional row vector x. A 
distribution x represents state j G S iff x(j) = 1 (thus x(i) = 0 when i yf j). 
If distribution x represents s G S, by abuse of language we also write x G S 
to mean that distribution x represents a state and we use x in place of the 
element of S represented by x. In the following we often represent states using 
distributions. This allows us to use matrix notation to define our computations. 

Definition 1. 1. A Discrete Time Markov Chain {just Markov Chain in the 
following) is a triple A4 = {S,P,g) where: S is a finite set (of states), 
q £ S and P .• S' x S' — >■ [0, 1] is a transition matrix, i.e. for all s £ S, 
XiggP(S)f) = 1- (We included the initial state q in the Markov Chain 
definition since in our context this will often shorten our notation.) 

2. An execution sequence {or path) in the Markov Chain Ai = {S,P,q) is 
a nonempty (finite or infinite) sequence tt = sqSiS 2 . . . where Si are states 
and P(si, Si+i) > 0, i = 0, 1, . . .. //tt = soSiS 2 . . . we write Tr{k) for Sfe. The 
length of a finite path tt = soSiS 2 . . . Sk is k (number of transitions) , whereas 
the length of an infinite path is uj. We denote with \tt\ the length of tt. We 
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denote with Path{Ai, s) the set of infinite paths tt in Ai s.t. 7r(0) = s. If 
M = (S,P,q) we write also Path{M) for Path{M,q). 

3. For s € S we denote with '^{s) the smallest cr-algebra on Path{Ai,s) 
which, for any finite path p starting at s, contains the basic cylinders { 
TT € Path(A4, s) j p is a prefix of t: }. The probability measure Pr on X)(s) 
is the unique measure with Pr{{TT € Path(Ai, s)|p is a prefix o/tt}) = Pr{p) 

= ntio p(*+ 1)) = p(p(o)>p(i))p(p(i)>p(2))-"P(p(fc- i),p(fc)), 

where k = \p\. 

E.g. given distribution x, the distribution y obtained by one execution step 
of Markov Chain Ai = {S,P,q) is computed as: y = xP. In particular if y = 
xP and x(s) = 1 we have that Vt[y(t) = (P)s,t]. 

3 Finite Horizon Safety Verification of Markov Chains 

Given a Markov Chain, we want to compute the probability that a path of length 
k starting from a given initial state q reaches a state s satisfying a given boolean 
formula (j) (i.e. (f>{s) = 1). If (/) models an error condition the above computation 
allows us to compute the probability of reaching an error condition in at most 
k transitions. 

Problem 1. Let Ai = (S', P,( 7 ) be a Markov Chain, k G N, and ^ be a boolean 
function on S. We want to compute: P{Ai,k,(j)) = Pr{{3i < k (j>{Tr{i))) \ tt G 
Path(AI)) That is, we want to compute the probability of reaching a state sa- 
tisfying (j) in at most k steps in Markov Chain Ai (starting from Ai initial state 
Cl)- 



Definition 2. Let Ai = (S,P,q) be a Markov Chain and let (j) be a boolean 
function on S, i.e. 4> : S ^ B. We define Markov Chain Ai^ as follows. 

fP(s,t) if^fiis) 

Aip = (S, P^, q), where for all s,t G S, P^(s, t) = { 1 if (f{s) A {s = f) 

[ 0 z/ (j){s) A (s^t) 

In other words, Markov Chain (S, P^, q) is obtained from (S, P, q) by remo- 
ving all outgoing edges from any state s satisfying (error state) and replacing 
such outgoing edges with just one edge leading back to s. Thus, once an error 
state is entered there is no way to leave it. This, in turn, means that for (S, P^, q) 
the probability of reaching in exactly k steps a state satisfying <f> is exactly the 
same as the probability of reaching in at most k steps a state satisfying (j). Note 
that according to item 1 of Definition 1 {S, P^, q) is indeed a Markov Chain. 

From the above considerations follow that P(Ai, k, fi) can be computed from 
P^ as shown in Proposition 1. Essentially Proposition 1 is a specialization to 
our finite horizon case of known results on PCTL Model Checking of Markov 
Chains (e.g. [13,1]). 

Proposition 1. Let Ai = (S', P, q), and let (j) be a boolean function on S. Then 
P(Ai,k,fi) = Pr{{3i < k </>(7r(fi)) | tt G Path(TW)) = 
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0.8 




Fig. 1. A Markov Chain 



Let (j) be defined as follows: </>(s) = 
(s = 2), i.e. only state 2 satisfies 4>. 



Then 



0.8 0.2 
0.0 1.0 ■ 



From Theor. 1 we have: = 

0.2; F(M,2,c/>) = 0.36; = 

0.488. 



Example 1. Consider Markov Chain A4 = (S', P,q) with S = {1,2}, P = 
~0 8 0 2 * 

p'g and q = [1 0] (i.e. distribution q denotes state 1). The usual automata- 
like representation for M is given in Fig. 1. 



4 Probabilistic Finite State Systems 

The Markov Chain Definition in Definition 1 is appropriate to study mathema- 
tical properties of Markov Chains. However Markov Chains arising from pro- 
babilistic concurrent systems are usually defined using a suitable programming 
language rather than a stochastic matrix. As a matter of fact the (huge) size of 
the stochastic matrix of concurrent systems is one of the main obstructions to 
overcome in probabilistic model checking. 

Thus a Markov Chain is presented to a model checker by defining (using 
a suitable programming language) a next state function that returns the nee- 
ded information about the immediate successors of a given state. The following 
definition formalizes this notion. 

Definition 3. A Probabilistic Finite State System (PFSS) S is a 3-tuple 
{S,q, next), where: S is a finite set (of states), q € S and next is a funetion 
taking a state s as argument and returning a set next(s) of pairs (t,p) s.t. 
X](t,p)Gnext(s) P = 1- 

To a PFSS we can associate a Markov Chain in a unique way. 

Definition 4. 1. Let S = (S', g, next) be a PFSS. The Markov Chain 

gmc _ assoeiated to S is defined as follows: P(s,t) = 

f p if (t,p) € next(s) 

{ 0 otherwise 

2. Given fc € N and a boolean function cf on S we write P(S, k, (f>) for 
P(S'^‘^,k,(j)) as defined in Problem 1. Thus Problem 1 for PFSSs becomes: 
given a PFSS S compute P{S, k, (jf). 

Given a PFSS S we want to compute P{S, k, (f) without generating the tran- 
sition matrix for Markov Chain Using Proposition 1 this can be done as 
shown in Proposition 2. 

Proposition 2. Let S = (S, q, next) be a PFSS, fc € N and (f> be a boolean 
function (j) on S . Then P{S, k, (jf) can be computed as shown in Fig. 2. 
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//For i = 1, . . . fc + 1, Q{i) is a queue of state-probability pairs (s,p); 

P{{S, g,next), k,(f>) { i = 1; r = 0; enqueue(Q(i), {q, 1)); 

forall i = 1 . . .k do { while (Q{i) is nonempty) { (s,p) = dequeue(Q(i)); 

forall (t, a) G next(s) do { if {r = r-|-p*a;} 

else enqueue(Q(i -I- 1), {t,p * a)); } } } return(r); } 



Fig. 2. Computation of P{S, k, (j>). 

Proof. (Sketch). Let A4 = = (S', P,q). Consider the following sequence of 

distributions: = q, yh+i) = y(®)p^ for i = 0, . . . k. From Proposition 1 we 

have that = P{S"^‘^,k,4>). Moreover, from Fig. 2 we have that 

for all i = 1, . . . fc -b 1, and for all s € S' , y^*^(s) 0 iff (s, y^*^s)) € Q(i). 

Note that states s s.t. 4>{s) = 1 are not enqueued. In fact, by Definition 2, the 
only state reachable form such a state s is s itself. Thus, from Definition 2 of 
P(j,, we have that the value r returned by P{{S,q, next), k,4>) in Fig. 2 is exactly 
P{S'''fk,(f). 



Remark 1. Given a PFSS S = (S, q, next), fc G N, a boolean function (f on S and 
a probability threshold p, in Section 5, exploiting Proposition 2, we will present 
an efficient disk based algorithm to check if it holds that P{S, k, (f) < p. In other 
words, our algorithm checks validity of a Finite Horizon Probabilistic (FHP) 
Safety Property. FHP safety properties are a very important class of properties. 
This motivates our disk based algorithm. 

Of course a FHP safety property can be easily defined with a PCTL [13] for- 
mula, namely P<p[true U-’^(j>]. Thus also the probabilistic model checker PRISM 
[25] can be used to verify FHP safety properties. 

Note however that PRISM can handle all PCTL formulas, whereas our al- 
gorithm can only handle FHP safety properties. In particular PRISM can verify 
properties like P<p[true Uip] {the probability of reaching a state satisfying (f is 
less than p). Such unbounded horizon properties cannot be handled with our 
algorithm. 



5 Analysing Probabilistic Systems with the Murc^ 
Verifier 

Building on the computation scheme in Fig. 2, in the following we describe 
an efficient disk based algorithm to verify FHP-safety properties, as well as 
an implementation of such an algorithm within the Mur<p verifier. We call the 
resulting tool FHP-Mur(/j {Finite Horizon Probabilistic Mur<p). 

5.1 Functions and Data Structures 

FHP-Mur(p input defines a PFSS S = {S, q,next) to which we will refer in the 
sequel. The FHP-Mur(/? keyword startstate defines S initial state q. Indeed, 
Mur(^ can have a set of initial states, however, w.l.o.g. in the following we assume 
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int k; /* is the horizon, i.e. the max allowed number of steps 
to reach a state violating our invariant */ 
boolean Phi(state s) ; state_probability_pairs next(state s) ; 

Queue Q_old, Q_new; Cache M; 

double prob_Phi; /* incrementally stores the probability of 

violating the invariant in at most k steps */ 
double max_prob_Phi ; /* is the max allowed value for prob_Phi */ 



Fig. 3. Functions and Data Structures 

we have just one initial state. FHP-Mur(/? keyword invariant defines the boolean 
function (j) on S as well as the probability threshold (3 s.t. P{S, k,<p) < (3 must 
hold (Remark 1). 

The meaning of the declarations in Fig. 3 is as follows. Constant k (imple- 
menting k) is our verification horizon and is given to FHP-Mur(/? as a command 
line parameter. Functions PhiO implements (j). Function nextO is the nextstate 
function of the PFSS S defined by FHP-Mur(p input. Thus function nextO ta- 
kes a state s as argument and returns the set next(s) of pairs {t,p) s.t. s goes 
to t with probability p. Queues Q_old and Q_new are used to store distributions. 
Thus queue elements are pairs (s,p) where s is a state and p is the probability 
of reaching s from the initial state of S. Such queues play, respectively, the same 
role as queues Q{i) and Q{i + 1) in the while loop in Fig. 2. Queues Q_old and 
Q_new are the only place in which state explosion may occur in our algorithm. 
For this reason we implement them on disk analogously to [31]. This allows us 
to handle fairly large state spaces. The hash table M is a cache whose entries are 
pairs (s,p) as for queues Q_old, Qjnew. Constant max_prob_Phi (implementing /3) 
defines our probability threshold, i.e. the max allowed value for the probability 
prob_Phi of reaching (within the given horizon k) an error state (i.e. a state s 
s.t. Phi(s) = true). 

Note that from the above discussion follows that Muri^ hash compaction (-c) 
[21] has no effect in FHP-Mur(/? since no FHP-Murtp data structure uses state 
signatures [29,30]. 



5.2 Functions SearchO and Insert () 

Our main function SearchO is shown in Fig. 4. This function efficiently imple- 
ments the computation described in Fig. 2. 

Function Insert () is shown in Fig. 4. This function uses a cache table M 
in RAM to save queue space and thus computation time. M [h] returns the pair 
(s,p) stored in entry h of M. M[h] . state denotes s and M[h] .prob denotes p. 

Every time it is necessary to enqueue a new pair (state s, probability p), 
Insert (s, p) is called. If state s is already stored in cache M, we simply update 
the stored probability in M, adding p to it. If state s is not stored in M, we check 
if the slot in M in which we have to put s is free. If it is free then we insert pair 
(s,p) in M. If it is not free, we call function ChecktableO to empty M and then 
we insert pair (s,p) in M. 
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int Search () { 
prob_Phi = 0; 

enqueue(Q_old, (q, 1)); /* enqueue initial state q */ 

for (level = 1; level <= k; level++) { 
clear cache table M; 
while (Q_old is not empty) { 

(s, p) = dequeue (Q_old) ; 
for all (s’, a) in next(s) { 
if (Phi(s’)) { 

prob_Phi = prob_Phi + p*a; 
if (prob_Phi >= max_prob_Phi) 

return ( 0) ; /* property does not hold */ 

}■ else Insert (s’, p*a) ; 

} /* for all */ 

} /* while, level terminated, Q_old is empty */ 
ChecktableO ; 

swap Q_new with Q_old ; /* now, Q_new is empty */ 

} /* for */ 

return(l); /* property holds */ 

}■ /* Search 0 */ 

Insert (state s, double p) ■[ 
if (s is in M) { 
h = hash(s) ; 
prob = M[h] .prob + p; 

M[h] = (s, prob); /* new probability of s is prob */ 
} else { 

collision = Insert_in_table (s , p) ; 
if (collision) { 

ChecktableO; /* there is space to insert now */ 
Insert_in_table (s , p) ; 

} 

} 

y /* Insert O */ 

boolean Insert_in_table (state s, double p) { 
h = hash(s) ; 
if (M[h] is free) { 

M[h] = (s, p) ; 
return true ; 

} 

else return (M[h], state == s) ; 

}■ /* Insert_in_table 0 */ 

ChecktableO ■[ 

move M in Q_new and clear M; /* M is empty now */ 
y /* ChecktableO */ 



Fig. 4. Functions: SearchO, Insert (), Insert_in_table () , ChecktableO 
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If we were not using M, for each state s at level i we would have w copies 
of s in the queue, where w is the number of paths of length i leading to state 
s from initial state q. Using M rather than w copies of s we have just one or 
slightly more than one (depending on how large is M). This saves queue space 
as well as computation time. Hence, the more RAM available for M, the less our 
duplicated states, queue sizes, number of states to be explored and, finally, our 
computation time. For this reason M should be as large as possible. 



5.3 Functions Insert_in_table () and ChecktableO 

Function Insert_in_table () is shown in Fig. 4. Function Insert_in_table () 
calculates the hash value h of s. If M [h] is a free slot, Insert_in_table () inserts 
s and p in M[h] and returns true. If M[h] is not free. Insert _in_table () returns 
false without inserting s and p in M. 

Function ChecktableO is shown in Fig. 4. It is the only function that en- 
queues values in Qjnew; it simply flushes M into Qjnew. 

Function ChecktableO is used by function Insert 0 to free M when a colli- 
sion occurs. It is also called at the end of the while in function Search 0 (Fig. 
4) to enqueue in Qjnew the states visited after the last call to function Insert 0, 
so that all states reached in the current level will be expanded in the next one. 

6 Experimental Results 

To show effectiveness of our approach we run two kind of experiments. 

First, in Section 6.1, we compare FHP-Mur(/? with the probabilistic model 
checker PRISM [25]. 

Second, in Section 6.2, we run FHP-Mur(/? on a quite large probabilistic 
hybrid systems. Since our main goal is to use FHP-Mur(/? on hybrid systems, 
this second kind of evaluation is very interesting for us. 



6.1 Probabilistic Dining Philosophers 

In this Section we give our experimental results on using FHP-Murt^ on the 
probabilistic protocols included in PRISM distribution [25]. We do not consider 
the protocols that lead to Markov Decision Processes or to Continuous Time 
Markov Chains, since FHP-Mur(p cannot deal with them. Hence we only consider 
Pnueli-Zuck [23] and Lehmann-Rabin [20,26] probabilistic dining philosophers 
protocols. Moreover, we modify PRISM definitions for such protocols in order to 
have a finite horizon property to verify with FHP-Muri^. In fact, FHP-Mur(p is 
unable to verify the PCTL properties for these protocols included in the PRISM 
distribution, since they are not of the required (finite horizon probabilistic safety) 
form P<p[true 

Finally, FHP-Mur(/? definitions for such protocols have been obtained by 
translating into FHP-Mur(/? their PRISM (modified) definitions so that for each 
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protocol, FHP-Mur(/? and PRISM definitions specify exactly the same Markov 
Chain. 

Our modifications to PRISM protocols consist in adding variables to count 
the number of times that a philosopher fails in getting both forks. We then verify 
that these counters are always less than a given maximum threshold (MAX_C0NT 
in the following) with a given probability. This corresponds to verify quality of 
service properties, which are very frequent in practice. E.g., in the Pnueli-Zuck 
protocol, we changed the code fragment in Fig. 5 with the one in Fig. 6. 

We want to know the probability P(MAX_C0NT, k) of a counter reaching 
MAX_C0NT in at most k (horizon) steps. We set k = 20 as our finite horizon (this 
value occurs in a property of the Lehmann-Rabin protocol in PRISM distribution 
[25]). 

Fig. 7 shows the PCTL property to be verified stating that the probability 
that a counter reaches MAX_C0NT has to be at most p. We set p = 1 since for 
computing P(MAX_C0NT, k) the value of p does not matter. 

In Fig. 8 we have the FHP-Mun^ code corresponding to the PRISM code 
fragment of Fig. 6. Of course FHP-Murt^ input language is the same as Mur(/? 
one [21], only FHP-Mur(/? has probabilities rather than booleans on rule guards. 
FHP-Mur(p invariant invariant p 7 requires that with probability at least p 
“all states reachable in at most k steps from the initial state satisfy 7” (fc is 
FHP-Mur(p horizon). Thus, using the notation in Section 5 we have that: 4> = 
-■7 and the probability threshold (max_prob_Phi in Fig. 3) is (1 — p). 

Note that in Fig. 8 the probability threshold for FHP-Mur(p invariant is 0, 
so that FHP-MurT) will not stop verification before completing all levels of the 
BF computation. This forces FHP-Murtp to compute P(MAX_C0NT, /c). 

To assess FHP-MurT) effectiveness in Figs. 9, 10 we compare the results 
obtained with FHP-Muri^ and with PRISM on, respectively, Pnueli-Zuck and 
Lehmann-Rabin protocols (modified as described above). 

From Fig. 9 we can see that, for Pnueli-Zuck algorithm, when NPHIL = 5 
(5 philosophers) and MAX_C0NT is 4, PRISM is unable to complete any verifica- 
tion within 2GB of RAM, independently on which of the 3 PRISM verification 
algorithms (totally MTBDD based, algebraic and hybrid) is chosen. Similarly, 
for the Lehmann-Rabin algorithm, in Fig. 10 we see that when NPHIL is 4, and 
MAX_WAIT is 3, then PRISM is unable to complete the verification task in the 
same environment as above. 

FHP-MurT? was always able to complete all given verifications tasks. Note 
however that, as it can be seen from Figs. 9 and 10, for the verifications tasks 
in which PRISM terminates, PRISM is always faster than FHP-Mur(/?. 

Our experimental results show that for probabilistic protocols involving arith- 
metical computations FHP-MurT? is to be considered among the available (and 
valuable) tools for automatic finite horizon analysis of safety properties. 

As for the numerical quality of FHP-Mur:/? we have that when both PRISM 
and FEP-Mur^j terminate both give the same value for P(MAX_CDNT, k) (column 
Probability in Figs. 9, 10). 
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module phill 

pi : [0 . . 10] init 0 ; 

[] pl=6 -> (pl’ = l); 

[] pl=7 -> (pl’ = l); 

[] pl=10 -> (pl>=0) 
endmodule 

Fig. 5. Pnueli-Zuck algorithm fragment to be modified in PRISM. 



module phill 

pi : [0 . . 10] init 0 ; 

contl: [0..3] init 0; 

[] pl=6 & contl !=MAX_CDNT -> (pl’=l) & (contl’=contl+l) ; 

[] pl=6 & contl=MAX_C0NT -> (pl’=l); 

[] pl=7 & contl !=MAX_CDNT -> (pl’=l) & (contl’=contl+l) ; 

[] pl=7 & contl=MAX_C0NT -> (pl’=l); 

[] pl=10 -> (pl’=0) & (contl ’=0); 
endmodule 

Fig. 6. Pnueli-Zuck algorithm modified fragment in PRISM. 



P>=1.0 [true U<=20 ((contl = MAX_C0NT) I (cont2 = MAX_C0NT) I 
(cont3 = MAX_C0NT))] 

Fig. 7 . PCTL formula in PRISM. 



function calc_prob(i : 1..NPHIL; c : 0..10) : prob; 

— probability that p[i] becomes c, NPHIL is the number of philosophers 
begin 

switch p[i] — p[l] corresponds to PRISM pi, p[2] to PRISM p2 etc 

case 6: if (c = 1) then return 1.0 / NPHIL; else return 0.0; endif 
case 7: if (c = 1) then return 1.0 / NPHIL; else return 0.0; endif 

endswitch; end; 

ruleset philosophers : 1.. NPHIL do ruleset next : 0..10 do rule "next" 
calc_prob (philosophers, next) ==> begin 
p[i] := c; 

— cont [1] corresponds to PRISM contl, cont [2] to PRISM cont2 etc 
if (c = 1 & (p[i] = 6 I p [i] = 7) & (cont[i] != MAX_C0NT)) 
then cont [i] := cont [i] + 1; endif; 
if (p[i] = 10 & c = 0) then cont [i] := 0; endif; end; end; end; 

invariant "starvation" 0.0 

forall i : 1.. NPHIL do (cont[i] != MAX_CDNT) endforall; 



Fig. 8. Pnueli-Zuck algorithm in FHP-Mur^. 
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244.170 
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213790.740 
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Fig. 9. Results on a machine with 2 processors (both INTEL Pentium III 500Mhz) 
and 2GB of RAM. Murip options: -b (bit compression), -m200 (use exactly 200MB of 
RAM), -maxl20 (the finite horizon is 20). The last verification had -mlOOO (use exactly 
1GB of RAM). PRISM options: default options. N/A means that PRISM was unable 
to complete the verification; in this case, also the -m and -s (totally MTBDD and 
algrebraic verification algorithm respectively) have been used, with the same result. 
Memory occupations are in MB, time is in seconds. 
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Fig. 10. W.r.t. Fig. 9, the only change is in the Mur:p option -m800 (use exactly 
800MB of RAM). 



6.2 Analysis of a Probabilistic Hybrid Systems with FHP-Mur<p 

In this section we show our experimental results on using FHP-Mur<^ for the 
analysis of a real world hybrid system. Namely, the Control System for the Gas 
Turbine of a 2MW Electric Co-generative Power Plant (ICARO) in operation at 
the ENEA Research Center of Casaccia (Italy). 

Our control system {Turbogas Control System, TCS, in the following) is the 
heart of ICARO and is indeed the most critical subsystem in ICARO. Unfortu- 
nately TCS is also the largest ICARO subsystem, thus making the use of model 
checking for such hybrid system a challenge. 

In [22] it is shown that by adding finite precision real numbers to Mur<p, 
we can use Mur:^ to automatically verify TCS. In particular in [22] it has been 
shown the following. If the the speed of variation of the user demand for electric 
power (MAX_D_U in the following) is greater than or equal to 25 (kW/sec), TCS 
fails in maintaining ICARO parameters within the required safety ranges. 

A TCS state in which one of ICARO parameters is outside its given safety 
range is of course considered an error state. 

In [22] the user demand has been modeled rather roughly, using nondetermi- 
nistic automata. Here we show that using FHP-Mur:^ we can define and, more 
importantly, automatically analyse, a more accurate model for the user demand 
by modeling it using a Markov Chain. 

To do this we define a function p{u, i) as follows: 

ro.4+/3 <"-^ir^| iff = i 

p{u,i) = < 0.2 if f = 0 (1) 

+ iff = -l 
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ruleset d_u : -1..1 do /* disturbance: takes values -1, 0 and 1 */ 
rule "time step" user_demand(u, d_u) ==> main(u, d_u) ; 
end; — user demand disturbance 



Fig. 11. Rulesets with probabilistic user demand 



MAXT)_U 
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Fig. 12. Results on a machine with 2 processors (both INTEL Pentium III SOOMhz) 
and 2GB of RAM. Murt^s options used: -b (bit compression), -m500 (use 500 MB of 
RAM). Time is given in seconds. 



where M =MAX_U (maximum user demand value) and a =MAX_D_U. 

Denoting with u(t) the user demand value at time t we can define the (sto- 
chastic) dynamics for the user demand as follows: 

{ min{u{t) + a, M) with probability p{u{t), 1) 

u(t) with probability p(u(t), 0) (2) 

max(u{t) — a, 0) with probability —1) 

In this way, we have that the further u(t) from uq, the higher the probability 
to return towards uq, i.e. to decrement u(t) if u(t) > uq and to increment it 
otherwise. 

To see that (2) is indeed a Markov Chain, it is sufficient to observe 
that, Vj3, the sum of the outgoing transitions is obviously 1. Moreover, since 
< X, as long as —0.4 < (3 < 0.4 holds, all probability values are 

between 0 and 1. 

With FHP-Mur<^ the definition of Markov Chain (2), starting from the TCS 
model, is quite simple. This is done in Fig. 11, where user_demand(u, d_u) 
computes p{u, dju) (1) and function main updates the system state, in particular 
updates u as described in (2). 

In Fig. 12 we report the results of some verification runs done by FHP-Murw 
with /3 = 0.4. 

We are interested in cases where the error probability is greater than 0 (zero) . 
From the results in [22] we know that this is the case if we choose MAX_D_U greater 
than or equal to 25 and the horizon value no smaller than the transition graph 
diameter. In our experiments here we choose our horizon as follows. Let Diam(n) 
be the diameter of TCS transition graph when MAX_D_U = n. We set our horizon 
k to be equal to [ 100- In this way we check the error probability in the 

error neighborhood. 

Fig. 12 allows us to evaluate the probability of reaching an error state when 
MAX_D_U is greater than or equal to 25. Note that such a probability is rather 






Finite Horizon Analysis of Markov Chains with the Mur^ Verifier 407 



small, suggesting that in many cases setting MAX_D_U to 25 may be acceptable. 
This kind of evaluations are not possible with the nondeterministic verification 
of TCS carried out in [22] . 

7 Conclusions 

We presented (Sections 3, 4) an explicit disk based verification algorithm for 
Probabilistic Systems defining discrete time/finite state Markov Chains. Given 
a Markov Chain and an integer k (horizon) our algorithm checks that the proba- 
bility of reaching a given error state in at most k steps is below a given probability 
threshold. 

We presented (Section 5) an implementation of our algorithm within a sui- 
table extension of the Murt^ verifier that we call FHP-Mur(/? {Finite Horizon 
Prohahilistic-Mxiup) . 

We presented (Section 6) experimental results comparing FHP-Muri^ with 
(a finite horizon subset of) PRISM, a state-of-the-art symbolic model checker 
for Markov Chains. Our experimental results show that FHP-Mur<p can handle 
systems that are out of reach for PRISM, namely those involving arithmetic 
operations on the state variables (e.g. hybrid systems). 

Future work includes extending our approach to other models (e.g. Conti- 
nuous Time Markov Chains) as well as to other kinds of PCTL formulas, e.g. 
formulas with unbounded until. 
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Abstract. This paper presents an efficient method to avoid memory 
explosion in symbolic model checking through the use of partitioning 
techniques. Dynamic repartitioning of Partitioned OBDDs (POBDDs) is 
investigated to enhance the efficiency of symbolic verification techniques. 
New and improved algorithms are presented for reachability based in- 
variant checking and for model checking a fraction of CTL that is found 
to be most important in practice. These algorithms hinge on dynami- 
cally repartitioning the state space and exploit the partitioned nature 
of the data structure. The effectiveness of the partitioning approach is 
demonstrated on both proprietary industrial designs as well as public 
benchmark circuits. Notably, the approach is able to verify, and in some 
cases falsify, properties of interest in industry on large designs which were 
otherwise intractable for verification by other state-of-the-art tools. 



1 Introduction 

Computation Tree Logic (CTL) [6] has proved to be a popular specification 
language for expressing properties for formal verification of designs, especially 
hardware. Model checking [6,7] is the prominent automatic formal verification 
methodology. Reduced Ordered Binary Decision Diagrams (ROBDDs) [4] cur- 
rently serve as the data structure of choice during symbolic model checking [13], 
because they have the desirable property of being canonical as well as manip- 
ulable. ROBDDs have efficient representations for many functions of practical 
interest. Unfortunately, some applications require representation of functions 
that only have exponential ROBDD size. This limits the complexity of problems 
that can be attacked by ROBDDs. 

A more efficient representation was proposed through the use of Partitioned- 
ROBDDs (POBDDs) [12] especially for large designs. In this approach, different 
partitions of the Boolean space are allowed to have different variable orderings 
and only one partition needs to be in memory at any given time. In this paper, 
we extend and improve this approach to address the following issues. 
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Firstly, we propose the use of dynamically Partitioned-OBDDs. This parti- 
tioning technique dynamically varies the number of partitions that are created 
and is thereby able to avoid memory explosion. Theoretical evidence [2] suggests 
that representations using this approach can be exponentially more compact 
than an approach using a fixed constant number of partitions. We incorporate 
this dynamic repartitioning in reachability based invariant checking as well as 
model checking for a portion of CTL. 

Secondly, we also propose a new algorithm for model checking a significant 
portion of CTL. This portion is defined as those formulae, which can be repre- 
sented without the use of the greatest fixpoint in existential normal form. More 
precisely, we efficiently handle the temporal modalities EX, EF and their du- 
als as well as EU. Such formulae are found to be a significant fraction of the 
properties that are of practical interest to hardware designers. In particular, this 
includes invariants as well as FSM deadlock avoidance properties. 

It has been previously shown [16,15] that POBDDs can be used analogously 
to OBDDs for most applications. However, a straightforward implementation 
using the conventional algorithm leads to excessive overhead in the form of disk 
accesses, BDD variable reorderings, etc.. The proposed algorithm leverages the 
partitioned nature of the data structure in order to significantly reduce these 
overheads. This is, to our knowledge, the first algorithm to take full advantage of 
the ideas of partitioning at an algorithmic level in the model checking procedure. 

Thirdly, though it may not be obvious, use of partitioning based represen- 
tation is not practical at all if one can not devise a practical and competitive 
strategy to discover, when appropriate, a path leading to an erroneous state. We 
provide a novel method to determine the same. In many cases, this method may 
be able to provide an error trace more efficiently than using classical OBDD 
based methods. 

To our knowledge, this is one of the few papers demonstrating the use of par- 
titioning based data structures in an industrial setting. On many public bench- 
mark circuits also it shows non-linear gains in space and time, often an order 
of magnitude or more, over the best known state of the art tool (VIS). Thus, 
we demonstrate that BDD-based verification can be expanded over the limits of 
classical ROBDD approaches. 

1.1 Comparison with Related Work 

The use of partitioned transition relations [5] was proposed to control the size of 
symbolic representation of transition relations. The set of latches is divided into 
different groups which control the ROBDD-size of the transition relation and 
allow early quantification as well. In POBDDs, the entire Boolean space is parti- 
tioned. Thus, in order to distinguish the sense in which partitioning is performed, 
it would be more appropriate to call the former as clustered-transition rela- 
tions. Indeed, the two approaches are orthogonal and these “clustered”-transition 
relations are used in the image computation of our approach as well. 

Recently, a method for distributed model checking was studied by [10,9]. It 
parallelizes the classical symbolic model checking algorithm using the partition- 




412 S. Iyer et al. 



ing approach suggested in [15]. This approach uses slicing, which is similar to 
partitioning, with the objective of doing model checking in a distributed fash- 
ion. This approach does not address issues related to costs of communication 
and variable ordering in different partitions. In particular, this approach par- 
titions the computation into a fixed number of fragments equal to the number 
of processors available in the distributed environment. However as noted in the 
literature [2] , a partitioning scheme with k partitions can be exponentially more 
succinct than one with just k — 1 partitions. Thus, the apriori selection of the 
number of fragments greatly limits the efficiency of the partitioned data struc- 
ture. Indeed the gain from such a static method would be obtained substantially 
from parallelization rather than from the inherent algorithmic advantages offered 
by the POBDD data structure. 

In contrast, our algorithms effectively capitalize on the partitioned nature of 
the data structure. We require only one partition to be in memory for any image 
computation, and each partition can be independently ordered. Significantly, 
this approach incorporates a dynamic repartitioning scheme which allows for an 
unbounded number of partitions to be automatically created when necessary. At 
the same time, we show how to drastically cut down the number of instances 
of inter-partition communications as compared to the classical approach. This 
reduces the number of transfers and reorderings of large HDDs between partitions 
and is found to be a significant gain in practice. We also address the issue of 
efficient determination of error trace in the presence of partitioning. 

In the rest of this paper, we first give an overview of POBDDs and the 
appropriate verification techniques. Then, we describe the proposed algorithms 
followed by the experimental results and finally conclusions. 

2 Preliminaries 

The idea of partitioning was used to discuss a function representation scheme 
called partitioned-ROBDDs in [12,11] which was extensively developed in [16]. 
Definition. [16] Given a Boolean function / : — >■ B, defined over n inputs 

Xn = {xi, . . . ,Xn}, the partitioned-ROBDD (henceforth, POBDD) representa- 
tion Xf of / is a set of k function pairs, \f = {(wi, fi), . . . , {wk, fk)} where, 
Wi : R" — >■ B and fi : R" — >■ B, are also defined over A„ and satisfy the following 
conditions: 

1. Wi and fi are ROBDDs respecting the variable ordering tt^, for 1 < t < A:. 

2. ici V ru2 V . . . V rufc = 1 

3. Wi A Wj = 0, for i yf j 

4. fi = Wi A f, for 1 < i < k The set {wi, . . . ,Wk} is denoted by W. Each 
Wi is called a window function and represents a partition of the Boolean space 
over which / is defined. Each partition is represented separately as an ROBDDs 
and can have a different variable order. Most ROBDD based algorithms can be 
adapted easily for POBDDs. 

Partitioned-ROBDDs are canonical and various Boolean operations can be 
efficiently performed on them just like ROBDDs. In addition, they can be ex- 
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ponentially more compact than ROBDDs for certain classes of functions. The 
practical utility of this representation is also demonstrated by constructing ROB- 
DDs for the outputs of combinational circuits [16]. An excellent comparison of 
the computational power of various BDD based representations and partitioned- 
ROBDDs may be found in [2] . 



2.1 Reachability and Model Checking 

We omit the syntax of CTL as it is widely known and readily available in the 
literature. We shall only note that it is possible to express any CTL formula in 
terms of the Boolean connectives of propositional logic and the existential tem- 
poral operators EX, EU and EG. Such a representation is called the existential 
normal form. 

Model Checking is usually performed in two stages: In the first stage, the 
finite state machine is reduced with respect to the formula being model checked 
and then the reachable states are computed. The second stage involves comput- 
ing the set of states falsifying the given formula. The reachable states computed 
earlier are used as a eare set in this step. 

Since there exist computational procedures for efficiently performing Boolean 
operations on symbolic BDD data structures, including POBDDs, model check- 
ing of CTL formulas primarily is concerned with the symbolic application of the 
temporal operators. EXq is a backward image and uses the same machinery as 
image computation during reachability, with the adjustment for the direction. 
EpUq (resp. EGp) has been traditionally represented as the least (resp. greatest) 
fixpoint of the operator t{Z) = qV {p A EXZ) (resp. t{Z) = p A EXZ). 

Invariants are CTL formulas of the form AGp, where p is a proposition, and 
can therefore be checked during the initial reachability computation itself. 

The standard reachability algorithm is based on a breadth-first traversal of 
finite-state machines [8,13,19]. The algorithm takes as inputs the set of initial 
states, I{s), expressed in terms of the present state variables, s, and a transition 
relation, T{s, s', i), relating the set of next states, N{s'), that a system can reach 
from a state s on an input i. The transition relation, T(s,.s',i), is obtained by 
taking a conjunction of the transition relations, Sf. = fk{s,i), of the individual 
state elements, i.e., T{s,s',i) = IK^fc = fk{s,i)). Given a set of states, R{s), 
that the system can reach, the set of next states, N{s'), is calculated using the 
equation N{s') = s', i) A i?(s)]. This calculation is also known as image 

computation. The set of reached states is computed by adding N{s) (obtained 
by replacing variables s' with s) to R{s) and iteratively performing the above 
image computation step until a fixed point is reached. 



2.2 Reachability Using POBDDs 

In the context of Partitioned-OBDDs, we can derive a transition relation, Tjk, 
from partition j into partition k by conjoining T with the respective window 
functions as Tjk{s, s' ,i) = Wj{s)wk{s')T{s, s' ,i). 
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preImgPart(43dd, j) { 

return preImage{Bdd, Tjj) 

} 

preImgComm(S){ 

result — 0 
foreach (partition j) 

temp = preImage{Sj,T.~) 
foreach (partition k ^ j) 

tempk = temp restricted to Wk 

reorder BDD tempk from partition order j to order k 
resultk = resultk V tempk 
end for 
end for 
return result 

} 



Fig. 1. Image Computation Algorithm 



The Partitioned-ROBDD based traversal algorithm uses the ROBDD based 
algorithm in its inner loop to perform fixed point on individual partitions. 
Let us assume that we are given a partitioned-ROBDD representation \R = 
{(■u;j(s ), < j < k}. If we take the image of Rj under Tjj, we obtain 
Nj(s') = 3s,i[wj{s)wj{s')T{s, s' , i)Rj{s)]. Since Wj(s') is independent of the vari- 
ables that are to be quantified, it can be taken out of existential quantification, 
giving us Nj{s') = Wj(s')[ 3s_4'u;j(s)T(s, s', f)Rj(s)] ] 

The image of Rj under Tjj lies completely within partition j. Similarly, the 
image, Ni of Rj under Tji will lie completely within partition 1. This observation 
motivates us to define the image computation in terms of the image computed 
within the same partition and the image communicated to another partition. The 
former will be called ImgPart and the latter will be called as ImgComm. Anal- 
ogously, we define the pre-image computations prelmgPart and prelmgComm. 
They are illustrated in the pseudo-code shown in Fig 1. 

The pre-image, i.e. computeEX, is then obtained by their union, as 
pyrelmage{p) := \/ ^pyreImgPart{pi,i) \/ preImgComm{p). 

The pseudo-code for computeEX, as applied to POBDD, is in Fig 2a. 

Notice that two approaches are possible for the computation of the commu- 
nicated image: In the first, an image is computed from partition j into each 
partition k ^ j separately, using the transition relation Tjk- Alternately, one can 
compute the image from partition j into the boolean space that is the comple- 
ment of partition j, denoted by j. The former has the advantage that the BDD 
representations of the transition relations Tjk are much smaller, but in return it 
has to perform 0{n^) image computations. We use the second method in defin- 
ing imgComm. This method requires only 0(n) image computations, but each 
of these is followed by 0{n) restrict operations. 
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3 Improved State Space Traversal 

In this section, we will describe the use of a dynamic partitioning scheme where 
the number of partitions can be increased or decreased as the computation pro- 
gresses. This can be shown to be exponentially more succinct than the use of 
a fixed constant number of partitions. We also present a novel algorithm for 
computing a path from a state with an error to the initial state. 

3.1 Dynamic Repartitioning 

Dynamic repartitioning of the state space is triggered whenever the size of any 
partition under observation crosses a certain threshold. The partitioning vari- 
ables are selected using the history of previously computed windows. Reparti- 
tioning is performed by splitting the given partition by cofactoring the entire 
state space based on one or more splitting variables until the blow-up has been 
ameliorated for each partition, which was created so far. Initially, the partition- 
ing is done using one splitting variable. The choice of this variable is as explained 
before. At this point, each new partition is checked to see whether the blow-up 
has subsided. If not, repartitioning is called again on that partition until the 
blow-up has subsided in all partition. 

Sometimes it is found that the blow-up in the BDD-sizes during an interme- 
diate step of image computation is a temporary phenomenon which eventually 
subsides by the time the image computation is completed. In such a case the in- 
vocation of dynamic global repartitioning of the state space could create a large 
number of partitions, whose BDD-sizes become eventually very small. These 
partitions create an unnecessary amount of computational overhead. Hence, it 
is advantageous to create these partitions locally only for that particular image 
computation and then recombine them before the end of the image computa- 
tion. To create these local partitions, we can cofactor the state space using the 
ordered list of splitting variables that was generated earlier. 

Our algorithm for checking invariants performs successive steps of image 
computation on each Rj under Tjj. Since these steps, imgPart, of image compu- 
tation add states only within the same partition, and since different partitions 
are disjoint, we are guaranteed that the same state is not being visited multiple 
times within different partitions. Once a fixpoint is reached within a partition 
j, the procedure imgComm is used to communicate the new set of states to the 
partition I for for 1 < ^ < /c and I yf j. At any stage, where new states are 
added into the reached states set, we check for the violation of the invariant 
presented. If failure is detected, we stop and call the error trace mechanism to 
retrieve a path from the initial states to an error state. Otherwise, we proceed 
with traversing more states until the entire state space is exhausted, at which 
point, the formula has passed. 

3.2 Tracing Erroneous Paths 

In order to obtain a path from an error state e back to an initial state i, the 
naive idea would be to compute successive preimages beginning with e, until 
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i is reached. After a few steps of computing backward images, one would be 
faced again with a rapidly increasing BDD size. In order to avoid this blow-up 
in BDD-size, we need to be able to isolate a set of candidate predecessors for 
the current state so that the next preimage computation does not have have to 
handle too large BDDs. In the case of ROBDDs, this is accomplished by keeping 
the so called “onion rings” or the frontier of states encountered during each 
image computation. 

In the partitioned setting, the set of possible predecessors may be spread 
across multiple partitions. Thus it is possible to store these frontier states in 
a partitioned manner. Therefore the backward image can be computed with 
respect to only a portion of the frontier states. 

Thus, the image computations need to be recorded in a tree-like data struc- 
ture in order to be able to find the correct subspace for the backward image. 
For each state s in the set of reachable states S, this tree contains the image 
computation when the state s was first added to the reachable set S. The struc- 
ture stores the information required to trace a backward path as follows: For 
each partition of the boolean space, its frontier is defined as the states added to 
this partition by the most recent invocation of imgComm and the subsequent 
imgPart operations. Each such frontier is actually a collection of sets, each rep- 
resented as a BDD, whose set union represents the set of all states that have 
been reached in this partitions for the first time, but have not yet been used for 
communication to other partitions. Thus, the number of BDDs in this frontier 
can be, in the worst case 0{M + di) where M is the number of partitions, and 
di is the depth of the fixpoint in partition i. For the entire graph this can, in the 
worst case be, 0{M * {M + dmax))- 

To retreive a path from an initial state to a state s, we do the following: 

1. Obtain the location in the computation tree that contains s. 

2. Take the predecessor frontier of this location in the tree, and compute a 

backward image into this frontier to find one or more predecessor states. 

3. Pick one such predecessor state. 

4. Repeat steps 2 and 3 on successive states until an initial state is reached. 

This gives us the backward path from state s with an error to an initial state. 
Advantages of partitioned error trace: Notice that in the case of ROBDDs, 
the onion rings can get large in size. An effect of having these large sized rep- 
resentations is that image computations get more expensive. As noted before, 
ignoring the frontier states and performing a backward reachability is even more 
expensive, and in that case the backward path can be longer in length too. 

Observe that partitions can often be assymetric with respect to the space 
and time required for performing image computations on them. Therefore, in 
the presence of multiple paths from an error state to the initial states, it would 
be advantageous to compute the shortest path in terms of computational effort 
rather than the length of the path. In order to do this, we annotate the nodes 
of the tree with information about the amount of time the corresponding image 
computation required. These annotations can be used as an indicator of how 
much time the backward image would take, and thus, in step 3 above, they can 
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assist in reducing the time spent in finding a more practical path back to the 
initial states. 

4 Model Checking Fixpoint Formulas 

As mentioned in section 2.2, the modalities EX, EU and EG suffice to represent 
any CTL formula in existential normal form. 

In particular, we note that the deadlock property AG{p — >■ EFq) can be rep- 
resented in the “greatest fixpoint free” fragment of CTL Since invariant checking 
and deadlocks form a large fraction of formulas that are of practical interest to 
designers, we will first look at the least fixpoint operator E{pUq). Note that, p 
and q are not restricted to propositions and can be any CTL formulae. 

4.1 Why Communication Is Expensive 

It is important to notice that there are fundamental differences between the 
two image operations - imgPart and imgComm. Observe that imgPart(i?j) is in 
the same partition j as the original BDD Rj and therefore only one partition 
needs to be in memory for its computation. On the other hand, imgComm(i?j) 
computes an image into j, i.e., every partition other than j, therefore it needs to 
finally access and modify every partition. This gives rise to two important issues 
with respect to communication. 

Firstly, the reached state set of every partition needs to be accessed. In the 
case of large designs, where the BDDs of even a single partition can run into 
millions of nodes, this usually means accessing stored partitions from the disk. 

Secondly, the BDD variable order of the computed imageset must be changed 
from the order of the partition to that of each of its target partitions, before 
the new states can be added to the reached set in the target. Again, for large 
designs, reordering a large BDD can be an extremely expensive operation. 

In this context, image computation within a partition, ImgPart, is a rela- 
tively inexpensive operation as compared to communication between partitions, 
ImgComm. Therefore, in the interest of minimising transfer of BDDs from one 
partition to another, we need a new algorithm that would decrease the number 
of invocations of ImgComm whenever possible. 

An associated advantage of performing image computation repeatedly within 
a partition before communicating, is that it allows some errors to be caught much 
earlier. When a formula fails in any partition, it becomes unnecessary to explore 
the other partitions any further. In this manner, it may be possible to locate the 
error by exploring a smaller fraction of the state space than otherwise necessary. 

In the rest of this section, we will present, in the context of POBDDs, the 
improved model checking algorithm designed to take advantage of partitioning. 

4.2 Evaluating the Least Fixpoint E(pUq) 

The classical algorithm for the least fixpoint operator is presented in Figure 2a 
in terms of the POBDD data structure. 
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computeEX(p) { 


computeEU(p, g) { 


R p 


S g and S.old f 


forall (partitions j) 


repeat 


Sj preImgPart{Rj,j) 


S.old ^ S 


end for 


forall (partitions j) 


S' S' V prelmgC omm{R) 


repeat 


output S 


Sj.old ^ — Sj 


} 


Sj <— Sj V (pj ApreImgPart{Sj , j)) 


computeEU(p, g) { 


until(Sj = Sj.old) 


S q and S.old f 


end for 


repeat 


S S V (p A prelmgC omm{S)) 


S.old ^ S 


until(S = S.old) 


S ■(— qV {pAcomputeEX{S)) 


output S 


until(S = S.old) 


} 


output S 

} 

a) Classical Algorithm 


b) New Algorithm 



Fig. 2. Algorithms for E{pUq) using Partitioned-OBDDs 



Notice that in the computation of E{pUq), the preimage computation forms 
the bulk of the work performed by the algorithm. As noted in section 4.1, the cost 
of performing communication during every preimage is quite large. This penalty 
is due to resources required to transfer BDDs between partitions, to reorder the 
BDDs before such transfer can occur and to fetch the partitions from storage 
in order that the new states can be conjuncted with p and disjuncted with q. 
Therefore, it is important to postpone the invocation of prelmgComm, i.e., to 
perform as many image computations as possible locally within each partition 
before communication is performed across partitions. 



A New Algorithm for E{pUq) 

In this section we describe a new algorithm for model checking least fixpoint 
CTL formulas and sketch a proof of its correctness. Algorithm 2b for computing 
the set E{pUq) is designed to take advantage of the partitioned nature of the 
data structure. Notice that we explore each partition independently of the others 
until they reach a fixpoint individually. Then, we perform the communication 
across partitions. 

This allows us to keep just one partition in memory at any given time. It 
also greatly reduces the number of communication induced BDD transfers, disk 
accesses and variable reordering calls. 

Before proving the correctness of the new algorithm, we define some notation. 
Let the set of states S at the end of the iteration of the outermost repeat-until 

loop in algorithm 2b be represented by S^. 

For every state s ^ E{pUq), either s |= g or there exists a sequence of states 
so,si,...,Sfc that has the smallest length k ^ 0 such that sq = s, Sk 1= q, 
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'ii < k Si \= p and \/i < k : Si € prelmage{si+i) . Such a sequence of states is 
called a witness for the inclusion of s in E{pUq), and k is its length. 

For the sake of convenience, we will use the symbol for a formula to also rep- 
resent the set of states it represents. We first show that algorithm 2b terminates. 



Lemma 1. (Termination) For any integer i, 3 S'®. The inequality is strict 
unless a fixpoint is reached. 

The proof is evident from the construction of sets S^. Since any step of the 
procedure must add at least one new state to the set S, we have termination 
at the end of at most as many iterations as there are states in the space under 
consideration. 

Theorem 1. The procedure computeEU of algorithm 2b, given the set of states 
corresponding to formulas p and q as inputs, terminates with the output S being 
precisely the set of states that model the formula EfpUq). 

Proof: Soundness: We prove by induction on the sets S^ that the procedure 
is sound, i.e., at all times S ^ E{pUq). This clearly holds for any state in the 
initial set = q, since any state satisfying q also satisfies E{pUq). 

Assume, it holds for S'®, i.e., that S® \= E{pUq). Consider a state s G 
^®+i _ Then, by construction of 5®+^ from S®, we have s \= p. Either s 
is added during some step of the inner fixpoint loop or it is added in a step of 
communication, i.e., s € preImgComm{S^). 

Suppose s is added in the inner fixpoint loop of some partition j. Since S® 
is a POBDD, let us call the projection of S® in partition j as S®. From before, 
we know preImgPart{Sj, j) C prelmgPart(S^) C preImage{S'‘). Also notice 
that the variable for the inner fixpoint is initialized to S® . Therefore, every state 
added in the first step of the inner fixpoint models pAEX{E{pUq)) and therefore 
models E{pUq). Consequently, we can show by induction that any state added 
in the inner fixpoint loop for partition j must model E(pUq). 

In the second case, s was added in some step of the communication. Con- 
sidering that preImgComm{S^) C preImage{S^), any state added in the com- 
munication step models p A EX{E{pUq)), and therefore E{pUq). In particular, 
s\=E{pUq). 

Consequently, 5'®+^ — S'® |= E{pUq) and the soundness of the procedure 
follows by induction. 

Completeness: We next show the completeness, i.e., that every state of 
E{pU q) is indeed in set S. Let T^ be the set of states, whose inclusion in E{pU q) 
is witnessed by a path of length at most k. We prove by induction on k that 
T'® C S. In the base case, this trivially holds because T^ = q = C S. 

Now, let us assume that T® C S. For any state s G T®"*"^ consider the sequence 
of states So = s, Si, . . . , s^+i that witnesses its inclusion in E{pUq). We will show 
that s G S. 

Now, the sequence si, . . . , s^+i is a witness for si, therefore si G T® C S. In 
particular, there exists a smallest j so that si G SA We know that s \= p and 
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s G prelmage(si) C preImage{S^). From the definition of and Algorithm 2b, 
we have that 

5 'j+i 3 gj \y (^p /\ preImgPart{S^)) V {p A preImgComm{S^) 

= V {p A {preImgPart{S^) V prelmgComm(S^))) 

= V (pA {preImage{S^))). 

Therefore, s G C S, whereby C S. By induction, this gives us 

P(pUg) C S. 

Together with lemma 1, this proves that algorithm 2b terminates with the 
set S' = E{pUq). 



4.3 Evaluating the Greatest Fixpoint EGp 

The model checking of EGp is done by computation of the greatest fixpoint 
of the operator t{Z) = p A EXZ. As in the case of least fixpoint, one would 
like to postpone the communication until after each partition has reached its 
individual fixpoint independent of the other partitions. However, the description 
of this is considerably more complex and thus far we have only implemented a 
simple, classical, version of the greatest fixpoint algorithm for EGp in terms of 
POBDDs. 

Even so, most specifications of interest in practice are expressible in the frag- 
ment of CTL free of greatest fixpoints. For e.g., deadlock avoidance properties 
of the form AG{p -A EFq) and invariants can both be expressed in existen- 
tial normal form using only least fixpoints. Therefore, we find that the inability 
to postpone communications for the greatest fixpoint does not impose a great 
disadvantage in most practical applications. 

5 Experiments 

We implemented dynamic partitioning-based model checking using the CUDD- 
package [18] (version 2.3.0) for OBDD representation. We use the routines from 
VIS [3] (version 1.4) for reading in the design and to build the initial transition 
relation using the IWLS95 method [17]. Our implementation can be thought of 
as building on top of VIS and therefore a comparison with VIS is natural. 

We found empirically that for our benchmarks VIS-2.0 using the MLP [14] 
method performs worse than VIS-1.4 using the IWLS95 method, probably due 
to known problems in preimage computation. Thus, we compared our methods 
to VIS by using the IWLS95 method for both. 

Benchmarks and Experimental Setup 

For our experiments, we used the designs from the Vis Verilog benchmark 
suite [1]. This suite also contains properties given in CTL formulas for verifi- 
cation. We pick the properties which when expressed existentially are “greatest 
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Table 1. Invariant Checking on Large Designs 



Circuit 


Number of 
Partitions 


Peak Nodes j 


1 Time (seconds) 


VIS 


POBDD 


Gain 


VIS 


POBDD 


Gain 


palu 


4 


371 K 


150 K 


2.5 


253 


102 


2.5 


product 


4 


919 K 


116 K 


7.9 


1394 


546 


2.6 


am2910 


4 


>1.52 M 


187 K 


>81 


>24h 


1.2 K 


>72 


rotate32 


43 


>825 K 


640 K 


>1.3 


>24h 


8 K 


> 8.6 


spinner32 


60 


>1.61 M 


362 K 


>4.4 


>24h 


10 K 


>11.3 


vsal6a 


4 


>1.02 M 


722 K 


>14 


>24h 


22.6 K 


>3.8 



fixpoint free” . On the entire benchmark suite this is found to cover about 80 % 
of all properties, which is believed to be typical. Finally, we also used proprietary 
designs that were made available by Fujitsu designers. 

The parameters of VIS and CUDD are left unchanged at their default values. 
Experiments on the public benchmarks were performed on dual-processor Xeon 
2.2Ghz workstations with 2 GB of RAM running Linux. The invariant checking 
as well as model checking experiments used dynamic partitioning. Both were run 
with a timeout limit of 24 hours. 

The peak number of live nodes is given by Peak Node. The GPU time is 
measured in seconds and given as Time. The column denoted with Time Gain 
(resp. Space Gain) describes the gain in time (space) of POBDDs over VIS. 

Results on Invariant Checking. We compare our POBDD method to the 
standard VIS approach on invariant checking in Table 1. Note that this table is 
restricted to the largest entries (BDD-nodes > 300K) in the benchmark suite. 
Our partitioned approach clearly outperforms the state-of-the-art VIS in time 
as well as in space. Especially for the larger circuits the improvement is drastic, 
since we complete the verification of four circuits that timed out using VIS. 

Comparison with Static Partitioning It is natural to analyse what benefit 
dynamic partitioning offers over static partitioning. In Fig. 3, we compare the 
performance of the proposed dynamic partitioning based invariant checking ap- 
proach with invariant checking based on the static partitioning method of [15]. In 
particular, note that in the last case, vcrc32-8, the previous approach timed out 
after 86,400 seconds whereas we are able to complete in about 12,000 seconds. 

Results on Model Checking. The results on runtime and space requirements 
in model checking are presented in Table 2. 

POBDDs may not sometimes show their full potential on the smaller circuits 
due to the overhead of creating and maintaining partitions. Nevertheless, the 
results show that POBDD-based model checking can out-perform VIS even on 
such cases in time as well as in space. 

But, more important are the last few entries in the table, showing the harder 
benchmarks. Here, the POBDD-based model checking clearly outperforms the 
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Fig. 3. Comparison of Times taken (Normalized) by Different Partitioning Approaches 
for Invariant Checking on some Large designs 



Table 2. Model Checking on Large Designs 



Circuit 


Number of 
Partitions 


Peak Nodes 


Time (seconds) 


VIS 


POBDD 


Gain 


VIS 


POBDD 


Gain 


product 


4 


919 K 


108 K 


8.5 


1450 


437 


3.3 


sl269b 


4 


2.3 M 


317 K 


7.2 


7340 


170 


43.0 


am2910 


4 


>4.9 M 


127 K 


>38.2 


>24h 


324 


>266 


twoQ 


6 


>5.5 M 


1.8 M 


>3.1 


>24h 


11.4 K 


>7.6 


palu 


12 


>10.5 M 


3 M 


>3.5 


>24h 


40.1 K 


>2.2 


am2901 


5 


>5.7 M 


1.94 M 


>2.9 


>24h 


45.4 K 


>1.9 



classical approach and is able to even finish four of the designs that cannot be 
finished within the given 24 hour timeout when using VIS. 

It is also noteworthy, that the maximum peak BDD-size of one partition 
is often an order of magnitude smaller than the maximum peak node size for 
ROBDDs. We have observed that this reduction is in many cases more than the 
number of partitions created. 



Industrial Circuits The properties for industrial circuits were taken from 
actual Fujitsu designs with sizes ranging from 2000 to 10000 fiip-fiops. Table 3 
shows the summarized results for the comparison of POBDD-based model check- 
ing with VIS for three different types of properties. For the first two properties, 
Index range and full-case, the POBDD method is able to finish 11 (resp. 5) more 
properties than the OBDD method. 
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Table 3. Model Checking of Industrial Circuits (2,000 to 10,000 flip-flops) 



Property Type 


Method 


Pass 


Fail 


Timeout 


Index out of range 


POBDD 


678 


0 


0 




VIS 


667 


0 


11 


Full Case 


POBDD 


16 


0 


0 




VIS 


11 


0 


5 


Synchronizer data stability 


POBDD 


2 


4 


0 




VIS 


0 


2 


4 



For the third property, data stability, the POBDD method is actually able 
to detect 2 failures more in addition to the passing properties than the OBDD 
approach. 

6 Conclusions 

In this paper we addressed the memory explosion problem associated with model 
checking through the use of dynamically Partitioned-OBDDs. We have shown 
that it can be significantly better for problems, where the state of the art can 
require impractically large computational resources. The significant advantage of 
the proposed verification technique is its ability to control the memory required. 
Usually, this has the added advantage of improvement in run-time, which is 
primarily governed by the BDD-sizes. On large circuits we find that the com- 
putational savings offered by the proposed partitioning based model checking 
can be significant. We have shown cases, where our proposed method could fin- 
ish in just a few thousand seconds, whereas other approaches timed out after a 
day. Importantly, a new algorithm for invariant checking and for model checking 
the fragment of CTL free of greatest fixpoint in the existential normal form are 
presented. This can handle many more properties of practical interest and truly 
exploit the theoretical and practical benefits of dynamically partitioned-OBDDs. 



Acknowledgment. The authors would like to thank Prof. E. Allen Emerson 
and Prof. David Dill for their advice and encouragement. 



References 

1. Vis verilog benchmarks http://vlsi.coforado.edu/ vis/. Technical report. 

2. B. Bolfig and I. Wegener. Partitioned bdds vs. other bdd models. In Proc. of the 
Inti. Workshop on Logic Synthesis, 1997. 

3. R. K. Brayton, G. D. ffachtel, A. L. Sangiovanni-Vincentelli, F. Somenzi, A. Aziz, 
S. Cheng, S. A. Edwards, S. P. Khatri, Y. Kukimoto, A. Pardo, S. Qadeer, R. K. 
Ranjan, S. Sarwary, T. R. Shiple, G. Swamy, and T. Villa. VIS: A System for 
Verification and Synthesis. In Computer Aided Verification, 1996. 

4. R. E. Bryant. Graph based algorithms for Boolean function representation. IEEE 
Transactions on Computers, 0-35:677-690, August 1986. 





424 S. Iyer et al. 



5. J. R. Burch, E. M. Clarke, and D. E. Long. Symbolic Model Checking with Par- 
titioned Transition Relations. In Proc. of the Desiqn Automation Conf., pages 
403-407, June 1991. 

6. E.M. Clarke and E.A. Emerson. Design and synthesis of synchronization skeletons 
using branching time temporal logic. In Proc. IBM Workshop on Logics of Pro- 
grams, volume 131 of Lecture Notes in Computer Science, pages 52-71. Springer- 
Verlag, 1981. 

7. E.M. Clarke, E.A. Emerson, and A.P. Sistla. Automatic verification of finite state 
concurrent systems using temporal logic specihcations. ACM Transactions on Pro- 
gramming Languages and Systems, 8:244-263, 1986. 

8. O. Coudert, C. Berthet, and J. C. Madre. Verification of Sequential Machines 
Based on Symbolic Execution. In Proc. of the Workshop on Automatic Verification 
Methods for Finite State Systems, Grenoble, France, 1989. 

9. Orna Grumberg, Tamir Heyman, and Assaf Schuster. Distributed symbolic model 
checking for /i-calculus. In Computer Aided Verification, pages 350-362, 2001. 

10. Tamir Heyman, Daniel Geist, Orna Grumberg, and Assaf Schuster. Achieving 
scalability in parallel reachability analysis of very large circuits. In Computer 
Aided Verification, pages 20-35, 2000. 

11. J. Jain. On analysis of boolean functions. Ph.D Dissertation, Dept, of Electrical 
and Computer Engineering, The University of Texas at Austin, 1993. 

12. J. Jain, J. Bitner, D. S. Fussell, and J. A. Abraham. Functional partitioning for 
verification and related problems. Brown/MIT VLSI Conference, March 1992. 

13. Kenneth L. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 
1993. 

14. I. Moon, G. D. Hachtel, and F. Somenzi. Border-Block Triangular Form and 
Conjunction Schedule in Image Computation. In Proc. of Formal Methods in CAD 
(FMCAD’OO), volume 1954 of Lecture Notes in Computer Science, 2000. 

15. A. Narayan, A. Isles, J. Jain, R. Brayton, and A. Sangiovanni-Vincentelli. Reach- 
ability Analysis Using Partitioned-ROBDDs. In Proc. of the Inti. Conf. on 
Computer-Aided Design, pages 388-393, 1997. 

16. A. Narayan, J. Jain, M. Fujita, and A. L. Sangiovanni-Vincentelli. Partitioned- 
ROBDDs - A Compact, Canonical and Efficiently Manipulable Representation for 
Boolean Functions. In Proc. of the Inti. Conf. on Computer-Aided Design, pages 
547-554, 1996. 

17. R. K. Ranjan, A. Aziz, R. K. Brayton, C. Pixley, and B. Plessier. Efficient BDD 
Algorithms for Synthesizing and Verifying Finite State Machines. In Proc. of the 
Inti. Workshop on Logic Synthesis, 1995. 

18. Fabio Somenzi. CUDD: CU Decision Diagram Package 
ftp://vlsi.colorado.edu/pub. Technical report. 

19. H. J. Touati, H. Savoj, B. Lin, R. K. Brayton, and A. L. Sangiovanni-Vincentelli. 
Implicit State Enumeration of Finite State Machines using BDD’s. In Proc. of the 
Inti. Conf. on Computer-Aided Design, pages 130-133, November 1990. 




Author Index 



Aagaard, Mark D. 66 
Abu-Haimed, Husam 158 
A1 Sammane, Ghiath 150 
Ashar, Pranav 334 

Earner, Sharon 35 
Beer, Ilan 141 
Berezin, Sergey 158 
Berger, Eli 141 
Beringer, Lennart 270 
Beyer, Sven 51 
Borrione, Dominique 150 
Bryant, Randal E. 348 

Casas, Jeremy 170 
Chaki, Sagar 19 
Chockler, Hana 111 
Clarke, Edmund 19 

Della Penna, Giuseppe 277, 394 
Dill, David L. 158 

Emerson, E. Allen 216, 247 
Encrenaz, Emmanuelle 164 

Fisler, Kathi 185 

Ganai, Malay K 334 
Geist, Daniel 3 
Gopalakrishnan, Ganesh 81 
Gordon, Mike 200 
Groce, Alex 19 
Gupta, Aarti 334 
Gurumurthy, Sankar 96 

Hooman, Jozef 231 
Hu, Alan J. 170 
Hunt, Warren A. 319 
Hurd, Joe 200 
Hymans, Charles 263 

Intrigila, Benedetto 277, 394 
Iyer, Subramanian 410 



Jacobi, Chris 51 
Jain, Jawahar 410 

Kahlon, Vineet 247 
Kroning, Daniel 51 
Krug, Robert Bellarmine 319 
Kupferman, Orna 96, 111 

Lahiri, Shuvendu K. 348 
Langberg, Michael 363 
Layouni, Mohamed 231 
Leinenbach, Dirk 51 
Lindstrom, Gary 81 

Manolios, Panagiotis 304 
Matusevich, Mark 141 
Melatti, Igor 277, 394 
Moore, J Strother 289, 319 

Narayan, Amit 410 

Ostier, Pierre 150 

Pastor, Enric 378 
Paul, Wolfgang J. 51 
Pena, Marco A. 378 
Pnueli, Amir 363 

Rabinovitz, Ishai 35 
Rodeh, Yoav 363 
Roesner, Wolfgang 1 
Roux, Cedric 164 

Sahoo, Debashis 410 
Schmaltz, Julien 150 
Sebastian!, Roberto 126 
Seshia, Sanjit A. 348 
Sheeran, Mary 4 
Singh, Satnam 283 
Slind, Konrad 81, 200 
Somenzi, Fabio 2, 96 
Slangier, Christian 410 
Strichman, Ofer 19 




426 



Author Index 



Tahar, Sofiene 231 
Toma, Diana 150 
Tonetta, Stefano 126 
Tronci, Enrico 277, 394 
Tzoref, Rachel 141 

Vardi, Moshe Y. 96, 111 



Venturini Zilli, Marisa 277, 394 

Wahl, Thomas 216 

Yang, Jin 170 
Yang, Yue 81 
Yang, Zijiang 334 




